Archive Project Planning Computers, Programming, Archive, Chronology

Archive Project Planning

A project I’ve been playing with for a few months now. I had a working prototype, but decided to do a full rebuild immediately as during the initial development I was learning a lot about what may or may not work.

Comprehensive Project Specification & Core Principles

1. Architectural Boundaries & Structural Authority

  • Zero Dependencies: The system must be built exclusively using the Python 3.11+ Standard Library. No external packages are permitted.
  • Single User & Local Scope: The system is explicitly restricted to single-user operation on local drives. Concurrent multi-user access and NAS/network drive deployments are strictly out of scope to prevent OS-level kernel locking race conditions over SMB/NFS.
  • Directory Segregation: The application code and the user archive must be strictly separated to prevent duplication and bloat during data migrations.
    • Application Root (chronology-engine/): Contains the codebase. Structured rigidly into: main.py (Presentation/CLI), core/ (Services/Logic), models/ (Immutable Dataclasses), storage/ (Database Interface), system/ (OS/Filesystem Interface), and utils/ (Exceptions, Logger).
    • Archive Root (chronology/): Contains the user data. Structured into: .chronology/ (hidden system folder for index.db, archive.lock, and logs/), ingest/ (drop zone for new files), quarantine/ (corrupt or invalid files), and YYYYMM/ (routed archive sibling directories).
  • Unidirectional Import Rule: To prevent contextual bleed, modules may only import from their own layer or layers strictly below them. Models are purely logical and import nothing. Interfaces (storage/, system/) import Models to enforce types. Services (core/) orchestrate Interfaces. Presentation (main.py) translates user input to Services and possesses zero knowledge of SQL or OS paths.
  • Hierarchical Authority & State Resolution:
    • Primary Truth (JSON): The physical filesystem—specifically the binary file and its accompanying JSON sidecar—is the primary authoritative source of truth.
    • Secondary Fallback (Database): The database functions primarily as a highly optimized read-replica, but elevates to Secondary Authoritative Fallback status exclusively when a physical JSON sidecar is missing, locked, or fails schema validation.
    • Parity Verification Constraint: The system must be capable of completely destroying and reconstructing the database using the JSON sidecars. However, destructive database rebuilds are strictly forbidden until a pre-flight validation guarantees a 1:1 parity exists between active database records and readable JSON sidecars on disk.
  • Exclusive Single-Actor Concurrency: All system operations—encompassing both mutating actions (Ingest, Tag, Setup) and read-only actions (Search, Audit)—strictly enforce an exclusive global lock. Concurrent access at the application level is fundamentally prohibited. If the global lock is currently held, any secondary request must instantly catch the OS-level contention exception, log the event, and safely abort.

2. Core Functional Logic & Services

Business logic executed by the core/ layer must strictly map to the following defined services: 1. Ingest: Normalizes filenames, processes fuzzy dates, tokenizes filename labels into initial tags, creates JSON sidecars, calculates SHA-256, routes valid files from ingest/ to YYYYMM/, and executes database insertions. Files failing temporal hierarchy validation or filename format constraints are immediately routed to quarantine/. 2. Audit: Full archive integrity scrub executed manually via user invocation. Re-hashes all binaries across the complete archive. Reconstructs orphaned DB records or sidecars if the binary SHA matches. Strictly routes to quarantine/ if the SHA mismatches (bit-rot). Execution processes sequentially across the entire active index until completion or manual cancellation. 3. Note: Provides append and overwrite functionality to the plain-text metadata note field. 4. Tag: Provides addition and removal of categorization tags. 5. Recovery: Rebuilds the entire SQLite database directly from authoritative JSON sidecars. 6. Search: Queries filenames, tags, and notes via SQLite FTS5. 7. Setup: Initializes the target archive directory structure and database schema during first run or recovery.

3. Concurrency, Locking, & Atomicity

  • The Undying Lock File: The OS-level lock file (chronology/.chronology/archive.lock) must never be deleted. Mutual exclusion must rely on OS-kernel capabilities enforcing atomic non-blocking exclusive acquisition (fcntl.flock with LOCK_EX | LOCK_NB on POSIX, or msvcrt.locking with LK_NBLCK on Windows) for every command executed by the engine. Existence checks (os.path.exists) are strictly forbidden. The system must attempt the exclusive lock, and if the OS raises an exception indicating contention, the engine must instantly catch the exception, log the collision event to the centralized logs, and safely abort the operation.
  • Strict Two-Phase Commits (Intent-Logged Protocol):
    1. Prepare Disk: Write deterministic .tmp files.
    2. Commit DB (Intent): Execute INSERT/UPDATE. Explicitly set pending_bin_sync = 1 and/or pending_json_sync = 1. Call conn.commit().
    3. Commit Disk: Elevate target OS-level permissions from read-only (0o444) to writable (0o644 or Windows equivalent) -> Execute os.replace to overwrite the physical file -> Re-apply strict read-only (0o444) permissions. Note: The global application lock (archive.lock) must remain continuously held during this phase.
    4. Acknowledge DB (Finalize): Execute UPDATE setting flags to 0. Call conn.commit().
    5. Failure State Protocol: Developers are strictly forbidden from writing exception handlers that attempt manual disk rollbacks. Standard exceptions must safely abort the thread and rely entirely on native Roll-Forward Recovery to resolve desynchronization via the intent flags on the next initialization.

4. Data Integrity & Immutability

  • OS-Level Immutability: Once ingested, binary files must be programmatically locked via OS permissions (e.g., 0o444 / Read-Only). State mutations temporarily toggle and re-lock this state.
  • Deduplication & Collision Protocol: Absolute binary deduplication is enforced via SHA-256 before any target OS file transfers or database operations begin. If a matching SHA-256 hash is detected within the active index, the ingest process must route the incoming duplicate file directly to chronology/quarantine/. The duplication event must be appended to the standard system log, and the ingest service must proceed to the next file in the queue without halting or raising a fatal exception.
  • Strong Typing: Data entering the system must immediately be parsed into strict, immutable data structures defined in chronology-engine/models/schema.py. Passing raw dictionaries between layers is forbidden. The dataclasses act as the sole boundary enforcer for array limits and must validate length constraints on instantiation (max 255-character filenames, max 128-character tags, max 50 tags per list), raising a centralized ValidationError if exceeded before any database intent or disk staging operations occur.

5. Domain Rules & Naming Conventions

  • Filename Format & Validation: Must strictly adhere to a maximum length of 235 (leaving room for versioning, e.g. -001 and possible compound extensions, e.g. .ext.ext.json) characters and match the regex ^\d{14}(?:-[a-z0-9_]+)*(?:\.[a-z0-9\+\-]+)+$, which enforces hyphens as the sole valid delimiter preceding label segments. Extensionless files are explicitly prohibited. Any file lacking a valid extension must immediately fail validation. To prevent catastrophic file-collision bugs between case-insensitive (Windows NTFS) and case-sensitive (Linux ext4) filesystems, all incoming filenames and extensions must be forcefully normalized to lowercase during ingestion.
  • JSON Sidecar Naming Convention: To prevent sidecar collisions when multiple files share an identical 14-digit timestamp and label but possess varying extensions (e.g., RAW vs. JPG pairs), JSON sidecars must append .json directly to the fully resolved binary filename. The format is strictly [filename].[ext].json (e.g., 20260101120000-vacation.jpg.json).
  • Label-to-Tag Mapping: Filenames may optionally contain descriptive string segments (labels) following the 14-digit timestamp. During ingestion, the system must tokenize the label segment strictly using hyphens (-) as delimiters. Underscores (_) must be preserved within the extracted tokens to permit compound tagging (e.g., new_york). The resulting strings must be populated into the JSON tags array. The 14-digit prefix, file extensions, and any trailing numeric collision suffixes (e.g., -001) must be explicitly excluded from tag generation.
  • Temporal Normalization & Fuzzy Logic: Subordinate temporal values cannot be defined if parent values are undefined. If the Month segment (MM) is 00, the Day segment (DD) must also be 00. Permutated fuzzy dates (e.g., YYYY00DDHHMMSS) are strictly invalid. Valid fuzzy dates (e.g., YYYY0000HHMMSS or YYYYMM00HHMMSS) shift their 00 segments to 01, append the tag fuzzy-date, and log the original filename in the note field. Year 0000 is absolutely invalid. Files failing these checks are logged as ‘Invalid Filename - Temporal Permutation’ and moved to quarantine/.
  • Zero-Fill Cascading Resolution: To maintain database compatibility, legacy filenames utilizing zero-filling for indeterminate dates must be shifted to the nearest valid chronological baseline.
    • Unknown month/day/time (YYYY0000000000) shifts to YYYY0101000000.
    • Unknown day/time (YYYYMM00000000) shifts to YYYYMM01000000.
    • Unknown time (YYYYMMDD000000) remains YYYYMMDD000000 (treated as exactly midnight).
    • Any file subjected to zero-fill resolution must automatically receive the fuzzy-date tag, and its original raw filename must be explicitly recorded in the note field.
  • Tagging Constraints: Tags consisting entirely of numbers are strictly forbidden. Trailing numeric collision suffixes (e.g., -001) must be stripped during ingestion and not counted as tags. Individual tags are strictly limited to a maximum length of 128 characters. The system must strictly reject tag arrays exceeding a hard limit of 50 items per file. Silent truncation is explicitly prohibited; exceeding the limit must immediately halt the active operation and return a standard validation error.
  • Collision Handling: Automatically append -001 through -999. Abort if suffix 999 is exceeded.
  • Deterministic Temp Files: [target_filename].[ext].tmp. Randomized temp files are explicitly forbidden.

6. OS-Level Constraints & File System Routing

  • Routing & Housekeeping: Target paths strictly evaluate to chronology/YYYYMM/. When deleting, remove OS junk (.DS_Store, thumbs.db) and delete the YYYYMM directory if empty.
  • Deterministic Sweep: During initialization or recovery, the engine must immediately delete any orphaned .tmp files on disk that do not possess a corresponding 1 intent flag in the database. This resolves crashes that occur during the “Prepare Disk” phase before database intents are logged, eliminating the need for arbitrary time-based file sweeping.
  • Windows Long Paths: Automatically prepend the \\?\ prefix for OS operations on Windows. Abstracted away from business logic.
  • POSIX DB Paths: SQLite database paths strictly use forward slashes (/), enforced by CHECK.
  • Unified Logging: The system must utilize logging.handlers.RotatingFileHandler configured in chronology-engine/utils/logger.py, explicitly targeting the chronology/.chronology/logs/ directory.

7. Database & Schema Specifications

  • Centralized Connectivity: All PRAGMA execution, connections, and raw SQL isolated to chronology-engine/storage/db.py.

  • Engine & Version: SQLite 3.37.0 or higher. Enforce PRAGMA foreign_keys = ON, PRAGMA journal_mode = WAL, and PRAGMA synchronous = NORMAL. All tables must explicitly enforce SQLite STRICT typing to prevent dynamic type corruption at the storage boundary. Table definitions must utilize CHECK (length(filename) <= 255). To maintain strict compatibility with SQLite 3.37.0, array length limits for tags must not be enforced via SQLite triggers. Boundary enforcement is exclusively delegated to upstream application-layer Dataclass validation.

  • FTS5 Indexing: A virtual table tracks filename, tags, and notes, synchronized purely via SQLite TRIGGERs. Application code must not manually insert into the FTS5 table.

  • Timezone & Metadata Handling: All OS-level hardware timestamps (file mtime, file ctime, system ingest time) must be forcefully normalized to UTC before database insertion or JSON creation. Logical timestamps derived from filenames remain naive.

  • JSON Schema Contract Updates: The schema must enforce ISO 8601 formatting, utilizing the Z suffix exclusively for fields normalized to UTC.

    {
    "schema_version": 1,
    "original_filename": "string",
    "archive_filename": "string",
    "logical_date": "YYYY-MM-DDTHH:MM:SS",       // Timezone-agnostic (derived from filename)
    "created_date": "YYYY-MM-DDTHH:MM:SSZ",      // UTC (derived from OS metadata)
    "modified_date": "YYYY-MM-DDTHH:MM:SSZ",     // UTC (derived from OS metadata)
    "ingest_date": "YYYY-MM-DDTHH:MM:SSZ",       // UTC (engine generation time)
    "file_extension": "string",
    "file_size_bytes": 0,
    "sha256": "string",
    "tags":["array", "of", "strings"],
    "note": "string"
    }

8. Disaster Recovery & Repair Mechanics

  • Centralized Exceptions: Standard library exceptions must be caught at domain boundaries and re-raised via a centralized hierarchy (e.g., ArchiveError, LockCollisionError, InvalidExtensionError) defined in chronology-engine/utils/exceptions.py. The concept of locking timeouts is explicitly out of scope.
  • Batch Memory Management: Bulk database insertion must support batch chunking (e.g., 10,000 records) to prevent memory overflows during recovery.
  • Roll-Forward Recovery & Quarantine: Upon acquiring the global archive.lock and prior to executing the primary logic of any requested operation (both mutating and read-only), the engine must immediately scan and resolve any DB records where pending_bin_sync = 1 or pending_json_sync = 1.
    • Binary Resolution: Locate .tmp and replace. If missing, hash the physical file. If it matches, the disk commit succeeded; update flag to 0. If the hash mismatches, declare irrecoverable payload loss, move the corrupted file to chronology/quarantine/, update the database note field to reflect quarantine status, and clear the intent flag.
  • JSON Resolution & Orphan Reconstruction: Locate .tmp and replace. If missing, the system must execute a strict three-tier state resolution hierarchy to prevent metadata loss:
    1. Primary (JSON Validation): Attempt to read and parse the existing JSON sidecar. If successful and schema-compliant, the file state is resolved.
    2. Secondary Fallback (DB to JSON Rebuild): If the JSON sidecar is missing, corrupted, or fails schema parsing, cleanly reconstruct the JSON sidecar on disk utilizing the existing database record.
    3. Tertiary Orphan Reconstruction (Binary to JSON/DB): If both the JSON sidecar is unreadable and the database record is missing, the file is declared a true orphan. The system must securely reconstruct both the database record and JSON sidecar by extracting the logical date from the binary filename and falling back to the physical binary’s UTC-normalized mtime for creation/modification dates.
    • Logging Requirement: Any file repaired via Secondary or Tertiary fallback mechanisms must trigger a centralized warning log containing the timestamp, filename, and the specific fallback tier invoked.

9. Directory tree and script locations

chronology-engine/
├── main.py                 # Presentation Layer (CLI)
├── core/
│   ├── __init__.py
│   ├── ingest.py           # Ingest Service
│   ├── audit.py            # Integrity Scrub Service
│   ├── metadata.py         # Note & Tag Services
│   ├── recovery.py         # DB Rebuild & Roll-Forward
│   └── search.py           # FTS5 Query Service
├── models/
│   ├── __init__.py
│   └── schema.py           # Immutable Dataclasses & Validation
├── storage/
│   ├── __init__.py
│   └── db.py               # SQLite Interface & Schema Init
├── system/
│   ├── __init__.py
│   ├── filesystem.py       # OS-Level I/O, Locking, Permissions
│   └── hashing.py          # Chunked SHA-256
└── utils/
    ├── __init__.py
    ├── exceptions.py       # Centralized Error Hierarchy
    └── logger.py           # RotatingFileHandler Setup
Search Titles & Keywords