Guardrails
Core Operating Rules
Section titled “Core Operating Rules”- Use
uv runfor all Python commands. - Keep changes incremental and architecture-preserving.
- Do not edit secrets (
.env, credentials) without explicit approval. - Avoid destructive git commands unless explicitly requested.
- Preserve stage boundaries (scrape → extract → analyze → report) — do not merge or bypass stages.
Data Safety
Section titled “Data Safety”- Treat scraped content as untrusted input — validate and sanitize before processing.
- Avoid logging sensitive source material in plain text.
- Persist outputs only through
forensics.storagehelpers unless a task requires another sink. - Never expose author PII (email addresses, phone numbers) in reports or logs.
- Scrape only public content; respect robots.txt and rate limits.
Stored raw HTML (P2-SEC-1)
Section titled “Stored raw HTML (P2-SEC-1)”Raw HTML is written under data/raw/ for reproducibility and parsed locally with BeautifulSoup. Treat these files as untrusted data at rest: do not serve them over HTTP without a dedicated sanitizer, do not eval or embed them in rich clients, and keep the dataset off shared drives without access controls. Text extraction strips tags for clean_text, but the on-disk HTML is unchanged from the origin server.
PII Handling
Section titled “PII Handling”- Author names are public data (bylines) — acceptable in analysis and reports.
- Author contact information (email, phone, social handles) must not be stored or logged.
- If PII is discovered in scraped content, redact before persisting to storage.
- Never include PII in error messages, logs, or exception tracebacks.
Error Classification
Section titled “Error Classification”| Severity | Description | Action |
|---|---|---|
| CRITICAL | Data corruption, security breach, PII exposure | Stop immediately. Alert human. Do not retry. |
| HIGH | Pipeline stage failure, storage write error | Log error. Retry once with backoff. Escalate if retry fails. |
| MEDIUM | API rate limit, transient network error | Log warning. Retry with exponential backoff (max 3 attempts). |
| LOW | Missing optional field, non-critical validation warning | Log info. Continue processing. Flag in report. |
Signs Architecture
Section titled “Signs Architecture”Signs are documented failure patterns that agents must recognize and avoid. They encode lessons learned from past mistakes and serve as guardrails against known pitfalls.
Each Sign has:
- Trigger: The condition or pattern that activates this Sign
- Instruction: What to do when the trigger is detected
- Reason: Why this matters
- Provenance: Where the Sign came from (Initial or Agent-learned)
Initial Signs
Section titled “Initial Signs”Sign: WordPress API Pagination Drift
- Trigger: Scraper receives fewer results than expected from paginated API calls
- Instruction: Always verify total page count from
X-WP-TotalPagesheader. Do not assume fixed page sizes. Re-fetch if count changes mid-scrape. - Reason: WordPress REST API pagination can shift when posts are published/unpublished during scraping.
- Provenance: Initial — known WordPress API behavior.
Sign: Embedding Model Version Mismatch
- Trigger: Feature extraction produces embeddings with unexpected dimensions or cosine similarities outside [0, 1] range
- Instruction: Verify
sentence-transformersmodel version matchesall-MiniLM-L6-v2. Check embedding dimensionality (should be 384). Never mix embeddings from different model versions. - Reason: Model updates can silently change embedding space, invalidating all downstream drift analysis.
- Provenance: Initial — known sentence-transformers behavior.
Sign: Parquet Schema Evolution
- Trigger: Writing features to Parquet fails with schema mismatch error
- Instruction: Never modify existing Parquet column types. Add new columns only. If schema must change, create a new versioned file and update the pipeline config.
- Reason: Parquet is columnar and schema changes corrupt existing data or break downstream readers.
- Provenance: Initial — Polars/Parquet constraint.
Sign: Rate Limit Cascade
- Trigger: Multiple scraper threads hitting rate limits simultaneously
- Instruction: Use single-threaded scraping with jitter (0.5-2.0s). If rate limited, back off exponentially starting at 5s. Log all rate limit events.
- Reason: Aggressive retry without backoff can trigger IP-level blocks on WordPress sites.
- Provenance: Initial — WordPress hosting behavior.
Sign: Collect-in-Middle Anti-Pattern
- Trigger: Code calls
.collect()on a Polars LazyFrame mid-pipeline then re-wraps as LazyFrame - Instruction: Defer
.collect()to the end of the pipeline. Use.pipe()for stage composition. If materialization is truly needed (e.g., for row count logging), use.fetch()for sampling or log after final collect. - Reason: Materializing mid-pipeline defeats lazy evaluation, wastes memory, and breaks query optimization.
- Provenance: Initial — Polars best practice.
Agent-Learned Signs
Section titled “Agent-Learned Signs”Sign: Mixing data and logs on stdout breaks agent parsing
- Trigger: A CLI command prints status, warnings, progress, or errors on stdout in text mode, or emits anything other than a single JSON envelope on stdout when
--output jsonis set; tests or operators pipe stdout intojqand get corrupted output. - Instruction: Route metadata, status lines, warnings, and errors to stderr in text mode; suppress them in JSON mode except the final envelope from
emit/ structuredfail. Only the command’s documented primary output belongs on stdout. Global--output jsonmust appear before the subcommand (uv run forensics --output json preflight). Seedocs/EXIT_CODES.mdand.claude/skills/forensics-cli/SKILL.md. - Reason: Agents and CI expect a stable contract: stdout = data (or one JSON line), stderr = everything else. Mixing streams makes automation non-deterministic and breaks
jq/JSON parsers. - Provenance: CLI agent-readiness prompt — 2026-04-26.
Sign: Repository Used Outside an Active Session
- Trigger: Code calls
Repository(db_path).upsert_*without enteringwith Repository(db_path) as repo:(or passesdb_pathinto ad-hocsqlite3.connecthelpers outsiderepository.py). - Instruction: Always use
with Repository(path) as repo:for SQLite writes/reads. For scrape orchestration, prefer injecting the samerepointocollect_article_metadata/fetch_articleswhen multiple operations should share one transaction. See ADR-005. - Reason: Session-scoped connections enable WAL + DEFERRED transactions and batch commits; using a closed or non-entered repository raises
RuntimeErrorand prevents silent autocommit sprawl. - Provenance: Agent-learned — 2026-04-20 code review (P1-ARCH-1, RF-SMELL-001), updated 2026-04-21 after
Repositorycontext manager landed.
Sign: Stage Directly Imports Another Stage’s Internals
- Trigger: A module in
scraper/imports fromstorage/repository.pydirectly (e.g.,from forensics.storage.repository import upsert_article), or any stage module imports internal functions from a different stage. - Instruction: Stages should return data structures to the orchestration layer (
forensics/cli/orpipeline.py), which handles persistence. If a stage needs to read data, it should receive it as a parameter, not reach into another stage’s storage layer. - Reason: Stage boundaries are architecturally sacred (ARCHITECTURE.md §Stage Contracts). Direct cross-stage imports create tight coupling that makes stages untestable in isolation and prevents swapping storage backends.
- Provenance: Agent-learned — 2026-04-20 code review (P2-ARCH-3).
Sign: God Function Exceeding 50 Lines in CLI/Orchestration
- Trigger: Any function in
forensics/cli/orpipeline.pyexceeds 50 lines (excluding docstrings and blank lines), or a single function handles more than 3 distinct flag/command combinations via sequentialifblocks. - Instruction: Decompose into a command registry or strategy mapping (see ADR-006). Each pipeline operation should be a separate callable registered in a dispatch table. New phases must slot in via registration, not by adding more
ifbranches. - Reason: The 117-line
_async_scrapefunction with ~18 cyclomatic complexity was flagged as the single most critical refactoring issue. Adding Phase 4–7 flags to this pattern would make it unmaintainable. - Provenance: Agent-learned — 2026-04-20 code review (RF-CPLX-001, P2-CQ-2).
Sign: Hand-Built Data Paths Instead of Centralized Helpers
- Trigger: Code constructs paths like
project_root / "data" / "features" / f"{slug}.parquet"orproject_root / "data" / "analysis" / ...manually instead of usingAnalysisArtifactPathsmethods. - Instruction: Always use
AnalysisArtifactPaths.features_parquet(slug),.analysis_json(slug),.drift_dir(slug), etc. If the method doesn’t exist, add it toAnalysisArtifactPathsfirst. Never hand-build paths todata/subdirectories. - Reason: Flagged in 3 of 5 review runs (RF-DRY-003). Hand-built paths create shotgun surgery when directory layout changes and are a recurring source of DRY violations.
- Provenance: Agent-learned — 2026-04-22 cross-run pattern analysis (5 reviews).
Sign: Inlined Feature Frame Loading Instead of Utility
- Trigger: Code calls
pl.scan_parquet(path).filter(pl.col("author_id") == ...)or equivalent outside ofanalysis/utils.py. - Instruction: Use
load_feature_frame_for_author()fromforensics.analysis.utils. If the utility doesn’t meet your needs, extend it — don’t duplicate inline. - Reason: Flagged in 3 of 5 review runs (RF-DRY-002). The load-filter-fallback pattern was duplicated in 5 locations.
- Provenance: Agent-learned — 2026-04-22 cross-run pattern analysis.
Sign: C901 Suppression Added Without Decomposition Plan
- Trigger: A new
per-file-ignoresentry for C901 is added topyproject.tomlwithout a corresponding decomposition task. - Instruction: Before adding a C901 suppression, first attempt to decompose the complex function. If decomposition is deferred, add an inline
# TODO(phase13): decompose — see RF-CX-NNNcomment AND create a tracking issue. Never suppress C901 silently. - Reason: C901 suppressions grew from 7 to 9 across review runs without corresponding reduction effort. Each suppression hides real complexity debt.
- Provenance: Agent-learned — 2026-04-22 cross-run pattern analysis.
Sign: BOCPD P(r=0) Posterior Is Pinned to the Hazard Rate
- Trigger: Any code (new or reviewed) threshold-compares
P(r_t = 0 | x_{1:t})orlog_pi_new[0]against a tunable constant, expecting it to rise on true changepoints under constant-hazard Adams & MacKay BOCPD. - Instruction: Do NOT threshold this quantity under constant hazard — it collapses algebraically to the hazard rate itself (
log_pi_new[0] = log_h + log_evidence; continuation mass= (1−h) × evidence; normalization= evidence; thereforeP(r=0) ≡ h). Use a MAP-run-length reset rule instead: emit a change-point when the posterior MAP run-length drops below a configurable fraction of its previous value (bocpd_map_drop_ratio,bocpd_min_run_length). See Phase A inprompts/phase15-optimizations/current.md. - Reason: Run-8 sensitivity review (Apr 24 2026) found that BOCPD emitted
p_cp ≡ hazard_ratefor every feature and every timestep across all 10 authors. No tuning of σ², prior, or threshold could move the quantity offh. Phase 15 removedbocpd_thresholdfrom the settings for this reason. - Provenance: Agent-learned — 2026-04-24 Phase-15 sensitivity review.
Sign: bulk_fetch_mode Metadata Column Is Effectively Empty
- Trigger: New analysis code reads
articles.metadata(JSON column) expecting populatedcategory/tagfields from WordPress. - Instruction: Do NOT assume
articles.metadatacarries section / category info — only 11 of 77,862 rows (0.01%) have it populated becausescraping.bulk_fetch_mode = trueskips the per-article metadata pass in favour ofcontent.renderedbulk-fetch. Use the URL first-path-segment as the canonical section tag:forensics.utils.url.section_from_url(url)gives 100% coverage. See Phase J1 inprompts/phase15-optimizations/current.md. - Reason: Reading the empty column wastes compute, silently degrades section-conditioned analyses, and hides the real source-of-truth (the URL path).
- Provenance: Agent-learned — 2026-04-24 article-tag audit (Phase 15 Unit 1 prep).
Sign: Do Not Mix Pre- and Post-Phase-15 Artifacts in One Analysis Run
- Trigger: Any code path (report builder, comparison loader, cache resolver, etc.) consumes
data/analysis/*_result.jsonfiles from runs with differentconfig_hashvalues. - Instruction: Treat mismatched
config_hashvalues as incompatible by design — the hash invalidates across the Phase-15 boundary because several signal-bearing fields (bocpd_detection_mode,convergence_min_feature_ratio,fdr_grouping,pelt_cost_model, etc.) now participate viajson_schema_extra={"include_in_config_hash": True}. Force a recompute (re-runforensics analyze) rather than attempting to merge. Seedocs/settings_phase15.mdfor the authoritative hash-participating field list. - Reason: Pooling pre- and post-Phase-15 artifacts mixes detections produced by incompatible detection rules, giving false confidence intervals and wrong FDR counts. The hash boundary exists to make the failure loud.
- Provenance: Agent-learned — 2026-04-24 Phase-15 provenance pre-registration (Unit 1 L5).
Sign: Pre-Phase-16 locked artifacts must be re-locked
- Trigger: A
preregistration_lock.jsonwas produced before Phase 16, ordata/analysis/*_result.jsonpredates Phase 16 while the activeconfig.tomlalready includes Phase-16 hash fields (embedding_model_revision,pelt_penalty,bocpd_hazard_rate,bocpd_min_run_length,min_articles_for_period, etc.). - Instruction: Regenerate the pre-registration lock (
uv run forensics lock-preregistration) after upgrading to Phase-16 settings so confirmatory thresholds match the expanded snapshot. Do not mix Phase-15 and Phase-16*_result.jsonartifacts in one report —validate_analysis_result_config_hashes()is designed to hard-fail onconfig_hashmismatch. - Reason: Phase 16 widened the analysis-config hash and the preregistration snapshot; an old lock silently approves the wrong threshold set relative to new embeddings and segmentation defaults.
- Provenance: Phase 16 adversarial remediation —
prompts/phase16-adversarial-review-remediation/current.mdStep A5 (2026-04-25).
Sign: corpus_hash_v1 and compute_corpus_hash_legacy are one-cycle transition artifacts
- Trigger: Phase 17+ maintenance where
corpus_custody.jsonconsumers still depend on legacy id-ordered hashing, or the dual-field payload is treated as permanent API. - Instruction: After operators have regenerated custody files under schema v2 and downstream audit paths no longer need id-ordered replay, remove
corpus_hash_v1fromCorpusCustodyinsrc/forensics/models/analysis.py, deletecompute_corpus_hash_legacyinsrc/forensics/utils/provenance.py, and drop schema v1 handling fromverify_corpus_hashper the Phase 16 plan closure checklist. - Reason:
corpus_hash_v1exists only to bridge pre–Phase-16 verification semantics (ORDER BY id, all rows) against the new analyzable-corpus fingerprint (WHERE is_duplicate = 0 ORDER BY content_hash). - Provenance: Phase 16 adversarial remediation — Step C4 (2026-04-25).
Sign: Per-author Polars filter returned empty — never re-collect the unfiltered frame
- Trigger: A helper filters a multi-author
LazyFrameto oneauthor_id(or slug) andcollect()yields zero rows; code then falls back to collecting the original unfilteredLazyFrameor reusing the full corpus. - Instruction: Treat an empty per-author frame as a hard skip: log a structured warning with
author_slug/author_id, returnNone(or raise at the boundary), and ensure callers do not proceed with a multi-author frame in place of a single-author slice. Downstream may otherwise attribute the wrong rows or inflate evidence counts. - Reason: PR #94 review — silent fallback turns a missing author slice into a whole-cohort analysis without an obvious failure mode.
- Provenance: Agent-learned — PR #94 remediation item 3 (
per_author.py), 2026-04-26.
Sign: set -e is silently disabled inside a for-loop body whose exit feeds an &&/|| chain
- Trigger: A bash chain like
(set -e; for slug in ...; do uv run forensics extract --author "$slug"; done && next-stage)— failing or killed commands inside the loop body do not abort the script, becauseset -eis suppressed when the enclosing compound command’s exit status is being tested by&&/||/if/while. - Instruction: Separate stages with plain
;(or newlines) at top level, neverdo ... done && next-stage. If you need fail-fast inside a loop body, check$?explicitly after each command andbreak/exiton nonzero. When killing a backgrounded chain, verify all descendants died — a SIGTERM to the wrapper does not always cascade. - Reason: An ostensibly-killed
forensics extract --author alex-griffingleft the parent loop running; it iterated tocharlie-nashand silently raced a replacement chain spawned with the same target, creating duplicate concurrent extracts on the same SQLite/parquet/manifest paths. - Provenance: Agent-learned — 2026-04-27 Path Bʹ → Path Bʺ recovery during a 12-author re-extract.
Sign: forensics extract --author <slug> historically rewrote data/embeddings/manifest.jsonl with only that author’s rows
- Trigger: Pre-2026-04-28 code in
src/forensics/features/pipeline.pycalledwrite_embeddings_manifest(manifest_records, paths.embeddings_dir / "manifest.jsonl")(an atomic full-file replace) withmanifest_recordscontaining only the author scoped by--author. Sequential per-author runs left the canonical manifest with only the last slug’s rows; analyze then failedEmbeddingDriftInputsErrorfor everyone else. - Instruction: When scoping pipeline writes to a single author, never call full-file rewrite helpers on shared artifacts. The patch (2026-04-28) writes per-author shards
<slug>_manifest.jsonlwhenauthor_slugis set, merged into the canonical manifest byscripts/merge_embedding_manifest_shards.pybeforeforensics analyze. Multi-call workflows must run that merge step. Parallel writers are safe under this scheme because each invocation owns its own shard path; concurrent writers to the same canonical file would still race. - Reason: A 12-author sequential
--authorchain consumed ~4 hours of CPU and would have left onlyzachary-leeman’s rows in the manifest, dooming the downstreamforensics analyzestep. - Provenance: Agent-learned — 2026-04-28 Path Bʺ remediation; patch landed in
pipeline.py:518–527plus newscripts/merge_embedding_manifest_shards.py.
Sign: Documentation site build artifacts must never be committed
- Trigger: A diff stages files under
website/dist/,website/.astro/,website/public/report/,website/src/content/docs/synced/,website/src/content/docs/adr/,website/src/content/docs/cli/, orwebsite/src/content/docs/api/. These directories are populated at build time byscripts/sync-docs.mjs,scripts/generate_cli_docs.py,bun run docs:python(pydoc-markdown), andquarto render. - Instruction: Keep every path above in
website/.gitignore. Local rebuilds may regenerate these directories without surfacing ingit status— that is intentional. Canonical operator markdown lives underdocs/; canonical CLI behavior lives insrc/forensics/cli/; the Quarto report lives undernotebooks/+_quarto.yml. Edit the canonical source, then rebuild — never edit the synced copy. - Reason: Synced / generated docs duplicate canonical sources. Committing them turns the docs build into a divergent fork (silent drift between
docs/RUNBOOK.mdandwebsite/src/content/docs/synced/runbook.md) and bloats clones. - Provenance: Agent-learned — 2026-05-10 Astro Starlight site cutover.
Known statistical limitations
Section titled “Known statistical limitations”- Serial autocorrelation (M-16): Consecutive articles by the same author carry correlated stylometric features. Welch and Mann–Whitney tests assume (approximate) independence between observations, so positive serial dependence can inflate Type I error rates. Within-family Benjamini–Hochberg is only a partial mitigation. Full mitigation requires block-bootstrap resampling or HAC-corrected standard errors; those are not applied in the default v0.4 analysis path.
Agent and Change Management
Section titled “Agent and Change Management”- Follow
AGENTS.mdin dev mode andAGENTS.staging.mdin staging mode. - Prefer small diffs with explicit validation steps.
- Document unresolved risks in
HANDOFF.md.
Validation Checklist
Section titled “Validation Checklist”Before handoff or merge:
uv run ruff check .uv run ruff format --check .uv run pytest tests/ -v- If evals exist:
uv run pytest tests/evals/ -v
Escalation Triggers
Section titled “Escalation Triggers”Stop and request explicit approval before:
- Changing provider/system-level architecture
- Modifying deployment credentials or infrastructure bindings
- Introducing non-deterministic runtime dependencies into core pipeline paths
- Changing data model contracts or stage boundaries
- Modifying storage layer schema (SQLite, Parquet, DuckDB)
- Any operation classified as CRITICAL severity