Testing Strategy
- Keep pipeline behavior deterministic.
- Validate CLI surface and stage orchestration.
- Catch regressions in output artifacts under
data/. - Property-test edge cases in feature extraction and data validation.
- Benchmark performance of compute-heavy stages.
Test Layout
Section titled “Test Layout”tests/unit/— stage logic, model validation, utility functionstests/integration/— CLI command coverage, pipeline wiring, storage round-tripstests/evals/— capability/regression eval scenariostests/fixtures/— sample data files (CSV, JSON, Parquet) for reproducible tests
Standard Commands
Section titled “Standard Commands”# Run all testsuv run pytest tests/ -v
# Run by categoryuv run pytest tests/unit -vuv run pytest tests/integration -vuv run pytest tests/evals/ -v
# Run specific testuv run pytest -k "test_feature_extraction" -v
# With coverage (uses ``[tool.pytest.ini_options]`` addopts: ``--cov=forensics``, ``fail_under``)# Install ``dev`` + ``tui`` extras so Textual-backed modules count toward the gate:# ``uv sync --extra dev --extra tui``uv run pytest tests/ -v
# Property-based testing with statisticsuv run pytest tests/ -v --hypothesis-show-statistics
# Stop on first failure (fast feedback)uv run pytest tests/unit -xQuality Gates
Section titled “Quality Gates”- Lint must pass:
uv run ruff check . - Format check must pass:
uv run ruff format --check . - Test suite must pass before merging.
- Coverage target: 72% (enforced in
pyproject.tomlviafail_under = 72on theforensicspackage). Raise this threshold only when the omitted modules are brought under test.
Deslop and hygiene PR checklist
Section titled “Deslop and hygiene PR checklist”Use this when trimming AI-generated cruft (comments, nesting, defensive noise) or tightening tests—correctness and contracts beat cleanliness (see repository deslop guidance: slice PRs, preserve stage boundaries, avoid silent behavior changes).
- Diff-first: Scope edits to churn or explicitly flagged modules; prefer one mergeable slice (directory or theme) with a green targeted test run before widening scope.
- Comments: Remove redundant restatements of Typer/Pydantic mechanics or duplicate config prose; keep or replace with a short factual line anything needed for forensic traceability, preregistration, or non-obvious invariants.
- Typing: Do not add new
# type: ignore(ortype: ignore[...]) insrc/ortests/on touched lines without an ADR-tracked exception—CI fails on new ignores in pull requests (see below). Prefer narrowing types,Protocol, or a one-linenoqawith rationale where Ruff applies. - Exceptions: Narrow
exceptto expected types only where tests cover failure modes; do not broaden handlers for readability alone. - Tests: Do not weaken assertions, shrink parametrization, or replace precise checks with “smoke only” to land a deslop PR—refactor fixtures and constants for clarity without changing the assertion surface.
Local check (same rule as CI): after fetching your PR base (e.g. main), run:
git fetch origin main # or your PR base branchuv run python scripts/check_no_new_type_ignore.py origin/mainKnown low-coverage hotspots (triage)
Section titled “Known low-coverage hotspots (triage)”After a full uv run pytest coverage report, these areas often remain thin until dedicated work lands. Use them as a backlog hint, not a blocker for unrelated PRs:
forensics.analysis.orchestrator—run_full_analysis/_run_per_author_analysis(heavy fixtures);run_compare_onlyandcompare_target_to_controlsare covered intests/unit/test_comparison_target_controls.py.forensics.reportingQuarto subprocess paths — require Quarto onPATHfor integration tests.forensics.clisubcommands beyond help smoke — deepen with TyperCliRunnerscenarios per command.forensics.tui.screens.launch/preflight— mostly manual or snapshot-style coverage.
Coverage Omission Policy
Section titled “Coverage Omission Policy”Coverage omissions in pyproject.toml must follow these rules:
- Per-file justification. Every omitted path must be a specific file (not a wildcard like
scraper/*). Each omission must have an inline comment explaining why it’s excluded (e.g.,# stub — Phase 4, not yet implemented). - Implemented modules must never be omitted. Once a module has real logic (not just
pass), it must be covered by the test suite. Removing it from coverage hides real test gaps. - Review omissions when stubs are implemented. When you implement a stub module, remove its coverage omission in the same PR. Do not leave stale exclusions.
- Separate profiles for reporting. If you need to distinguish “coverage of implemented code” from “coverage of entire project,” use separate coverage profiles or report configurations — not blanket omissions that hide the true state.
Example of correct omissions:
[tool.coverage.run]source = ["forensics"]branch = trueomit = [ "*/forensics/features/lexical.py", # stub — Phase 4 "*/forensics/features/structural.py", # stub — Phase 4 "*/forensics/features/content.py", # stub — Phase 4 "*/forensics/features/productivity.py", # stub — Phase 4 "*/forensics/features/readability.py", # stub — Phase 4 "*/forensics/features/embeddings.py", # stub — Phase 4 "*/forensics/features/pipeline.py", # stub — Phase 4 "*/forensics/analysis/changepoint.py", # stub — Phase 5 "*/forensics/analysis/timeseries.py", # stub — Phase 5 "*/forensics/analysis/drift.py", # stub — Phase 6 "*/forensics/analysis/convergence.py", # stub — Phase 6 "*/forensics/analysis/comparison.py", # stub — Phase 6 "*/forensics/analysis/statistics.py", # stub — Phase 6 "*/forensics/storage/parquet.py", # stub — Phase 7 "*/forensics/storage/duckdb_queries.py", # stub — Phase 7 "*/forensics/pipeline.py", # stub — orchestration not yet wired]TDD Workflow
Section titled “TDD Workflow”For every new feature or bugfix:
- Red — Write a failing test that describes the expected behavior.
- Green — Write the minimum code to make the test pass.
- Refactor — Clean up while keeping tests green.
- Validate — Run full suite (
uv run pytest tests/ -v) before committing.
For pipeline stages, write the stage contract test first (input type → output type → expected shape), then implement the stage.
Property-Based Testing
Section titled “Property-Based Testing”Use Hypothesis for testing functions with many edge cases, especially in feature extraction and data validation.
from hypothesis import given, strategies as st
@given(st.text(min_size=1, max_size=10000))def test_ttr_is_bounded(text: str): """Type-token ratio must always be between 0 and 1.""" words = text.split() if len(words) > 0: ttr = len(set(words)) / len(words) assert 0.0 <= ttr <= 1.0
@given(st.lists(st.floats(allow_nan=False, allow_infinity=False), min_size=2))def test_cosine_similarity_bounded(values: list[float]): """Cosine similarity must be in [-1, 1].""" # Test your cosine similarity implementation here passGood candidates for property-based tests in this project:
- Feature extraction functions (TTR, Yule’s K, MATTR — all have mathematical bounds)
- Pydantic model validation (ensure invalid inputs always raise ValidationError)
- Simhash deduplication (Hamming distance properties)
- Change-point detection output shapes (number of changepoints ≤ number of observations)
Hypothesis Requirements
Section titled “Hypothesis Requirements”Hypothesis is a declared dev dependency. It must be actively used, not just listed. The following modules require property-based tests:
- Parsing utilities (
scraper/parser.py) — HTML parsing should handle arbitrary input without crashing. Test withst.text()inputs containing malformed HTML, empty strings, and edge-case Unicode. - Hashing utilities (
utils/hashing.py) — Simhash fingerprints must be deterministic and the Hamming distance function must satisfy triangle inequality. Test withst.binary()andst.text(). - Feature extraction (Phase 4, when implemented) — All feature functions with mathematical bounds (TTR ∈ [0,1], Yule’s K ≥ 0, sentence lengths ≥ 0) must have property tests asserting those bounds.
- Pydantic model validation — Models accepting external input must reject invalid data consistently. Test with
st.from_type()or custom strategies.
When implementing a new module, check whether property-based tests are appropriate before writing only example-based tests. If the function has invariants (bounded outputs, deterministic behavior, algebraic properties), write a @given test.
Performance Benchmarks
Section titled “Performance Benchmarks”For compute-heavy stages, add benchmark tests with pytest-benchmark or simple timing assertions:
import time
def test_feature_extraction_performance(large_corpus): """Feature extraction should process 1000 articles in under 60 seconds.""" start = time.perf_counter() results = extract_features(large_corpus) elapsed = time.perf_counter() - start assert elapsed < 60.0, f"Feature extraction took {elapsed:.1f}s (limit: 60s)" assert len(results) == len(large_corpus)Benchmark targets:
- Feature extraction: < 60s for 1000 articles
- Embedding generation: < 120s for 1000 articles (GPU) / < 600s (CPU)
- Change-point detection: < 30s for 10-year author timeline
- Full pipeline (all stages): < 30 minutes for a single author
- Embedding batch Parquet/NPZ I/O —
tests/test_embedding_batch_performance.py: default pytest run includes a median-timing ratio check (onewrite_author_embedding_batchvs N per-articlenp.savewrites). For the optional large synthetic write+read ceiling, runuv run pytest tests/test_embedding_batch_performance.py -m slow --no-cov(excluded from default-m 'not slow'inpyproject.toml).
Test Authoring Guidance
Section titled “Test Authoring Guidance”- Keep tests deterministic (no network dependency by default).
- Use temporary directories (
tmp_path) for file-write assertions. - Validate both return payloads and filesystem side-effects.
- Add regression tests for any behavior change before/alongside implementation.
- Use
@pytest.mark.slowfor tests that exceed 5 seconds. - Use
@pytest.mark.integrationfor tests requiring external resources. - Mock external APIs (WordPress REST API, sentence-transformers) in unit tests.
Fixture Strategy
Section titled “Fixture Strategy”@pytest.fixturedef sample_articles() -> list[dict]: """Minimal article set for unit testing.""" return [ {"url": "https://example.com/1", "author_id": 1, "text": "...", "word_count": 500}, {"url": "https://example.com/2", "author_id": 1, "text": "...", "word_count": 300}, ]
@pytest.fixturedef feature_vectors(sample_articles) -> pl.DataFrame: """Pre-computed feature vectors for analysis tests.""" return pl.DataFrame({ "article_id": [1, 2], "ttr": [0.65, 0.72], "yules_k": [120.5, 135.2], "mean_sentence_length": [18.3, 22.1], })