3 Data Collection (Scraping)

Forensic question: What data was collected, from where, and when?

Input artifacts: - data/articles.db — canonical article store - data/authors_manifest.jsonl — discovery output

Output artifacts: - data/articles.jsonl — export mirror (when scrape completes export)

Run metadata: (auto-populated by first code cell)

3.0.1 Methodology

Two-step collection: WordPress REST discovery for metadata, then HTML fetch and parser extraction into clean_text with content_hash and scraped_at provenance.

Summary finding: This chapter documents collection mechanics and empirical coverage; see tables and charts above for per-author counts and cadence.