ADR-007: Scraper / storage boundary (P2-ARCH-3)
- Status: Accepted (partial)
- Date: 2026-04-21
- Deciders: John Eakin
Context
Section titled “Context”The Code Review Report (P2-ARCH-3) recommends that crawler/scraper stages return in-memory structures and let an orchestration layer persist to SQLite, instead of importing Repository directly from crawler.py / fetcher.py.
Decision
Section titled “Decision”Hybrid. Scraper modules still perform persistence (they remain the implementation of the scrape stage), but:
Repositoryis optional at the scrape API boundary —collect_article_metadata(..., repo=None)andfetch_articles(..., repo=None)accept an injectedRepositoryinstance. When omitted, they open their own session (backwards compatible).- CLI orchestration reuses one session —
forensics.cli.scrapepasses a singlewith Repository(db_path) as repo:into metadata/fetch paths where a full pipeline step should commit atomically (e.g. discover+metadata, full scrape).
Pure in-memory scrape results with a separate persistence stage remain optional future work if a second storage backend is required.
Consequences
Section titled “Consequences”- Reduces connection churn for multi-step scrape flows without rewriting HTML/metadata parsers.
- Scraper modules still import
Repositoryfor the default path; new code should prefer passingrepo=from orchestration when batching writes.
Contract for injected Repository
Section titled “Contract for injected Repository”Callers that pass repo= into collect_article_metadata(...) or fetch_articles(...) must respect both halves of the contract:
- Session lifetime. The injected
Repositorymust remain open (inside itswithblock) for the full duration of the async call. Closing the repo from another task while the fetch is in flight producesRuntimeErrorfrom_require_conn()— callers, not the scraper, are responsible for that window. - Partial-failure semantics. When
repo=None(default), the scraper opens and closes its own session and only the last successfulupsert_*commit survives on exception. When an externalrepois injected, rollback on partial failure is the caller’s responsibility: callers who require all-or-nothing semantics must wrap the call in an explicit transaction (or re-open a fresh session on retry). The hybrid scraper does not itself emitBEGIN/ROLLBACK.
These rules are enforceable by inspection — any call site injecting repo lives under forensics.cli.scrape or forensics.pipeline, which today satisfies both.
Related
Section titled “Related”- Code Review Report: P2-ARCH-3
- ADR-005 (session-scoped
Repository) - ADR-006 (CLI dispatch patterns)
Built by Abstract Data