Archive Project Planning
Comprehensive Project Specification & Core Principles
2. Core Functional Logic & Services
Business logic executed by the core/ layer must strictly map to the following defined services: 1. Ingest: Normalizes filenames, processes fuzzy dates, tokenizes filename labels into initial tags, creates JSON sidecars, calculates SHA-256, routes valid files from ingest/ to YYYYMM/, and executes database insertions. Files failing temporal hierarchy validation or filename format constraints are immediately routed to quarantine/. 2. Audit: Full archive integrity scrub executed manually via user invocation. Re-hashes all binaries across the complete archive. Reconstructs orphaned DB records or sidecars if the binary SHA matches. Strictly routes to quarantine/ if the SHA mismatches (bit-rot). Execution processes sequentially across the entire active index until completion or manual cancellation. 3. Note: Provides append and overwrite functionality to the plain-text metadata note field. 4. Tag: Provides addition and removal of categorization tags. 5. Recovery: Rebuilds the entire SQLite database directly from authoritative JSON sidecars. 6. Search: Queries filenames, tags, and notes via SQLite FTS5. 7. Setup: Initializes the target archive directory structure and database schema during first run or recovery.
3. Concurrency, Locking, & Atomicity
- The Undying Lock File: The OS-level lock file (
chronology/.chronology/archive.lock) must never be deleted. Mutual exclusion must rely on OS-kernel capabilities enforcing atomic non-blocking exclusive acquisition (fcntl.flockwithLOCK_EX | LOCK_NBon POSIX, ormsvcrt.lockingwithLK_NBLCKon Windows) for every command executed by the engine. Existence checks (os.path.exists) are strictly forbidden. The system must attempt the exclusive lock, and if the OS raises an exception indicating contention, the engine must instantly catch the exception, log the collision event to the centralized logs, and safely abort the operation. - Strict Two-Phase Commits (Intent-Logged Protocol):
- Prepare Disk: Write deterministic
.tmpfiles. - Commit DB (Intent): Execute
INSERT/UPDATE. Explicitly setpending_bin_sync = 1and/orpending_json_sync = 1. Callconn.commit(). - Commit Disk: Elevate target OS-level permissions from read-only (
0o444) to writable (0o644or Windows equivalent) -> Executeos.replaceto overwrite the physical file -> Re-apply strict read-only (0o444) permissions. Note: The global application lock (archive.lock) must remain continuously held during this phase. - Acknowledge DB (Finalize): Execute
UPDATEsetting flags to0. Callconn.commit(). - Failure State Protocol: Developers are strictly forbidden from writing exception handlers that attempt manual disk rollbacks. Standard exceptions must safely abort the thread and rely entirely on native Roll-Forward Recovery to resolve desynchronization via the intent flags on the next initialization.
- Prepare Disk: Write deterministic
4. Data Integrity & Immutability
- OS-Level Immutability: Once ingested, binary files must be programmatically locked via OS permissions (e.g.,
0o444/ Read-Only). State mutations temporarily toggle and re-lock this state. - Deduplication & Collision Protocol: Absolute binary deduplication is enforced via SHA-256 before any target OS file transfers or database operations begin. If a matching SHA-256 hash is detected within the active index, the ingest process must route the incoming duplicate file directly to
chronology/quarantine/. The duplication event must be appended to the standard system log, and the ingest service must proceed to the next file in the queue without halting or raising a fatal exception. - Strong Typing: Data entering the system must immediately be parsed into strict, immutable data structures defined in
chronology-engine/models/schema.py. Passing raw dictionaries between layers is forbidden. The dataclasses act as the sole boundary enforcer for array limits and must validate length constraints on instantiation (max 255-character filenames, max 128-character tags, max 50 tags per list), raising a centralizedValidationErrorif exceeded before any database intent or disk staging operations occur.
5. Domain Rules & Naming Conventions
- Filename Format & Validation: Must strictly adhere to a maximum length of 235 (leaving room for versioning, e.g.
-001and possible compound extensions, e.g..ext.ext.json) characters and match the regex^\d{14}(?:-[a-z0-9_]+)*(?:\.[a-z0-9\+\-]+)+$, which enforces hyphens as the sole valid delimiter preceding label segments. Extensionless files are explicitly prohibited. Any file lacking a valid extension must immediately fail validation. To prevent catastrophic file-collision bugs between case-insensitive (Windows NTFS) and case-sensitive (Linux ext4) filesystems, all incoming filenames and extensions must be forcefully normalized to lowercase during ingestion. - JSON Sidecar Naming Convention: To prevent sidecar collisions when multiple files share an identical 14-digit timestamp and label but possess varying extensions (e.g., RAW vs. JPG pairs), JSON sidecars must append
.jsondirectly to the fully resolved binary filename. The format is strictly[filename].[ext].json(e.g.,20260101120000-vacation.jpg.json). - Label-to-Tag Mapping: Filenames may optionally contain descriptive string segments (labels) following the 14-digit timestamp. During ingestion, the system must tokenize the label segment strictly using hyphens (
-) as delimiters. Underscores (_) must be preserved within the extracted tokens to permit compound tagging (e.g.,new_york). The resulting strings must be populated into the JSONtagsarray. The 14-digit prefix, file extensions, and any trailing numeric collision suffixes (e.g.,-001) must be explicitly excluded from tag generation. - Temporal Normalization & Fuzzy Logic: Subordinate temporal values cannot be defined if parent values are undefined. If the Month segment (
MM) is00, the Day segment (DD) must also be00. Permutated fuzzy dates (e.g.,YYYY00DDHHMMSS) are strictly invalid. Valid fuzzy dates (e.g.,YYYY0000HHMMSSorYYYYMM00HHMMSS) shift their00segments to01, append the tagfuzzy-date, and log the original filename in the note field. Year0000is absolutely invalid. Files failing these checks are logged as ‘Invalid Filename - Temporal Permutation’ and moved toquarantine/. - Zero-Fill Cascading Resolution: To maintain database compatibility, legacy filenames utilizing zero-filling for indeterminate dates must be shifted to the nearest valid chronological baseline.
- Unknown month/day/time (
YYYY0000000000) shifts toYYYY0101000000. - Unknown day/time (
YYYYMM00000000) shifts toYYYYMM01000000. - Unknown time (
YYYYMMDD000000) remainsYYYYMMDD000000(treated as exactly midnight). - Any file subjected to zero-fill resolution must automatically receive the
fuzzy-datetag, and its original raw filename must be explicitly recorded in thenotefield.
- Unknown month/day/time (
- Tagging Constraints: Tags consisting entirely of numbers are strictly forbidden. Trailing numeric collision suffixes (e.g., -001) must be stripped during ingestion and not counted as tags. Individual tags are strictly limited to a maximum length of 128 characters. The system must strictly reject tag arrays exceeding a hard limit of 50 items per file. Silent truncation is explicitly prohibited; exceeding the limit must immediately halt the active operation and return a standard validation error.
- Collision Handling: Automatically append
-001through-999. Abort if suffix999is exceeded. - Deterministic Temp Files:
[target_filename].[ext].tmp. Randomized temp files are explicitly forbidden.
6. OS-Level Constraints & File System Routing
- Routing & Housekeeping: Target paths strictly evaluate to
chronology/YYYYMM/. When deleting, remove OS junk (.DS_Store,thumbs.db) and delete theYYYYMMdirectory if empty. - Deterministic Sweep: During initialization or recovery, the engine must immediately delete any orphaned
.tmpfiles on disk that do not possess a corresponding1intent flag in the database. This resolves crashes that occur during the “Prepare Disk” phase before database intents are logged, eliminating the need for arbitrary time-based file sweeping. - Windows Long Paths: Automatically prepend the
\\?\prefix for OS operations on Windows. Abstracted away from business logic. - POSIX DB Paths: SQLite database paths strictly use forward slashes (
/), enforced byCHECK. - Unified Logging: The system must utilize
logging.handlers.RotatingFileHandlerconfigured inchronology-engine/utils/logger.py, explicitly targeting thechronology/.chronology/logs/directory.
7. Database & Schema Specifications
Centralized Connectivity: All PRAGMA execution, connections, and raw SQL isolated to
chronology-engine/storage/db.py.Engine & Version: SQLite 3.37.0 or higher. Enforce
PRAGMA foreign_keys = ON,PRAGMA journal_mode = WAL, andPRAGMA synchronous = NORMAL. All tables must explicitly enforce SQLiteSTRICTtyping to prevent dynamic type corruption at the storage boundary. Table definitions must utilizeCHECK (length(filename) <= 255). To maintain strict compatibility with SQLite 3.37.0, array length limits for tags must not be enforced via SQLite triggers. Boundary enforcement is exclusively delegated to upstream application-layer Dataclass validation.FTS5 Indexing: A virtual table tracks filename, tags, and notes, synchronized purely via SQLite
TRIGGERs. Application code must not manually insert into the FTS5 table.Timezone & Metadata Handling: All OS-level hardware timestamps (file
mtime, filectime, system ingest time) must be forcefully normalized to UTC before database insertion or JSON creation. Logical timestamps derived from filenames remain naive.JSON Schema Contract Updates: The schema must enforce ISO 8601 formatting, utilizing the
Zsuffix exclusively for fields normalized to UTC.{ "schema_version": 1, "original_filename": "string", "archive_filename": "string", "logical_date": "YYYY-MM-DDTHH:MM:SS", // Timezone-agnostic (derived from filename) "created_date": "YYYY-MM-DDTHH:MM:SSZ", // UTC (derived from OS metadata) "modified_date": "YYYY-MM-DDTHH:MM:SSZ", // UTC (derived from OS metadata) "ingest_date": "YYYY-MM-DDTHH:MM:SSZ", // UTC (engine generation time) "file_extension": "string", "file_size_bytes": 0, "sha256": "string", "tags":["array", "of", "strings"], "note": "string" }
8. Disaster Recovery & Repair Mechanics
- Centralized Exceptions: Standard library exceptions must be caught at domain boundaries and re-raised via a centralized hierarchy (e.g.,
ArchiveError,LockCollisionError,InvalidExtensionError) defined inchronology-engine/utils/exceptions.py. The concept of locking timeouts is explicitly out of scope. - Batch Memory Management: Bulk database insertion must support batch chunking (e.g., 10,000 records) to prevent memory overflows during recovery.
- Roll-Forward Recovery & Quarantine: Upon acquiring the global
archive.lockand prior to executing the primary logic of any requested operation (both mutating and read-only), the engine must immediately scan and resolve any DB records wherepending_bin_sync = 1orpending_json_sync = 1.- Binary Resolution: Locate
.tmpand replace. If missing, hash the physical file. If it matches, the disk commit succeeded; update flag to0. If the hash mismatches, declare irrecoverable payload loss, move the corrupted file tochronology/quarantine/, update the database note field to reflect quarantine status, and clear the intent flag.
- Binary Resolution: Locate
- JSON Resolution & Orphan Reconstruction: Locate
.tmpand replace. If missing, the system must execute a strict three-tier state resolution hierarchy to prevent metadata loss:- Primary (JSON Validation): Attempt to read and parse the existing JSON sidecar. If successful and schema-compliant, the file state is resolved.
- Secondary Fallback (DB to JSON Rebuild): If the JSON sidecar is missing, corrupted, or fails schema parsing, cleanly reconstruct the JSON sidecar on disk utilizing the existing database record.
- Tertiary Orphan Reconstruction (Binary to JSON/DB): If both the JSON sidecar is unreadable and the database record is missing, the file is declared a true orphan. The system must securely reconstruct both the database record and JSON sidecar by extracting the logical date from the binary filename and falling back to the physical binary’s UTC-normalized
mtimefor creation/modification dates.
- Logging Requirement: Any file repaired via Secondary or Tertiary fallback mechanisms must trigger a centralized warning log containing the timestamp, filename, and the specific fallback tier invoked.
9. Directory tree and script locations
chronology-engine/
├── main.py # Presentation Layer (CLI)
├── core/
│ ├── __init__.py
│ ├── ingest.py # Ingest Service
│ ├── audit.py # Integrity Scrub Service
│ ├── metadata.py # Note & Tag Services
│ ├── recovery.py # DB Rebuild & Roll-Forward
│ └── search.py # FTS5 Query Service
├── models/
│ ├── __init__.py
│ └── schema.py # Immutable Dataclasses & Validation
├── storage/
│ ├── __init__.py
│ └── db.py # SQLite Interface & Schema Init
├── system/
│ ├── __init__.py
│ ├── filesystem.py # OS-Level I/O, Locking, Permissions
│ └── hashing.py # Chunked SHA-256
└── utils/
├── __init__.py
├── exceptions.py # Centralized Error Hierarchy
└── logger.py # RotatingFileHandler Setup