ADR-004: Scraper storage concurrency — writer-queue vs aiosqlite
- Status: Proposed (no implementation until explicitly approved)
- Date: 2026-04-21
- Note: Unified under
docs/adr/as ADR-004 because ADR-002 is reserved for storage layer. - Scope: Forensics scraper pipeline after incremental Phases A–C (narrow
db_locksections,asyncio.to_threadfor short SQL, parallel discovery/metadata where bounded, streaming dedup/export). This ADR addresses the next structural step if further throughput or simplicity is required.
Context
Section titled “Context”The scraper uses a single sqlite3 connection owned by Repository, with an asyncio.Lock (db_lock) serializing access from async tasks. Phases A–C reduce how much work runs under that lock and shrink memory for dedup/export, but the fundamental pattern remains: many concurrent producers (HTTP, parse, disk) funnel into one synchronous database API on one connection.
Two common ways to evolve this are:
- Writer-queue (producer/consumer): Fetch coroutines enqueue completed write intents; one writer task drains the queue and executes SQL (optionally in batched transactions).
aiosqlite: Replace directsqlite3usage inRepositorywith async database calls so coroutines await I/O instead of blocking the event loop or usingasyncio.to_threadper statement.
The project constraint (per planning) is to preserve Repository, the schema contract, resumability, and JSONL side-channels unless a future ADR explicitly changes them.
Candidate 1: Writer-queue pattern
Section titled “Candidate 1: Writer-queue pattern”Sketch
Section titled “Sketch”- Producers (
_fetch_one_article_html, metadata ingest, etc.) build immutable or copy-on-write “write commands” (e.g. upsert payload, or a small enum + fields). - They
await queue.put(cmd)and continue; a single writer task loopscmd = await queue.get(), runsrepomethods on the one connection, and may batch:BEGIN;… N upserts …COMMIT. - Shutdown uses a sentinel on the queue and
asyncio.gatherof the writer with producers so errors propagate deterministically.
- Preserves the
Repositorysurface and synchronous semantics for non-async callers (dedup, export, CLI paths that stay sync). - Removes lock contention from fetchers: producers rarely block each other on
db_lock; the queue absorbs bursts. - Natural batching: the writer can commit every k rows or every t ms, improving SQLite amortization without changing HTTP or parse code.
- Keeps
sqlite3: no new dependency, no wholesale rewrite ofrepository.py.
- Coordination layer: queue depth limits, backpressure, and cancellation must be designed (maxsize,
asyncio.wait_for, structured concurrency). - Error propagation: a failed batch must decide whether to retry, dead-letter, or fail the run; partial batches need clear semantics with resumability.
- Ordering: if any invariant depended on strict global ordering of writes, the queue must preserve per-entity ordering (e.g. one queue per shard, or sequence numbers per article id).
Candidate 2: aiosqlite swap
Section titled “Candidate 2: aiosqlite swap”Sketch
Section titled “Sketch”Repositorymethods becomeasync def(or a parallelAsyncRepository) usingaiosqlitewith a single connection.- Callers
await repo.upsert_article(...); the event loop is not blocked by SQLite in the same way as blockingsqlite3calls.
- Idiomatic async: fewer
to_threadwrappers; mental model aligns with asyncio-first scraper code. - Single dependency with a well-trodden path for async SQLite.
- Large diff:
repository.pyand every async caller must change; anything that assumed syncRepositoryneeds an async boundary or duplication. - Contract change: “Is
Repositorysync or async?” becomes a cross-cutting design decision for tests and CLI. - Throughput ceiling: one connection still serializes writes under the hood; moving to
aiosqlitemainly improves event-loop fairness and code style, not raw write parallelism, unless combined with batching or multiple connections (which would violate the current single-connection model without a new ADR).
Decision (recommendation)
Section titled “Decision (recommendation)”Recommend Candidate 1 (writer-queue) as the next structural step after Phases A–C land and prove stable.
Rationale: Writer-queue attacks the dominant remaining bottleneck—many small transactions and lock-shaped work—while keeping Repository synchronous and reusable. Batched commits are a direct lever on SQLite performance. An aiosqlite migration is a broad API churn for incremental ergonomic gain on a single connection; it does not by itself deliver the same win as a dedicated writer with explicit batching.
Non-goals for this ADR: Changing WordPress endpoints, retry policy, schema, raw/ layout, or removing db_lock until a follow-up design explicitly replaces it with queue semantics.
Consequences
Section titled “Consequences”- If approved: Implement writer-queue in a focused PR: define command types, writer loop, graceful shutdown, and tests for backpressure and failure isolation. Keep
Repositoryimplementation onsqlite3initially. - If deferred: Continue with Phases A–C only; revisit when profiling shows commit overhead or writer starvation despite smaller critical sections.
- If
aiosqliteis chosen instead: Expect a wider migration (all repository consumers), strong test coverage for async transaction boundaries, and explicit documentation of the async contract.
Implementation gate
Section titled “Implementation gate”Do not implement writer-queue, aiosqlite, or hybrid approaches based on this ADR until stakeholders explicitly approve the chosen candidate (expected: writer-queue per above).