3 Data Collection (Scraping)
Forensic question: What data was collected, from where, and when?
Input artifacts: - data/articles.db — canonical article store - data/authors_manifest.jsonl — discovery output
Output artifacts: - data/articles.jsonl — export mirror (when scrape completes export)
Run metadata: (auto-populated by first code cell)
3.0.1 Methodology
Two-step collection: WordPress REST discovery for metadata, then HTML fetch and parser extraction into clean_text with content_hash and scraped_at provenance.
Summary finding: This chapter documents collection mechanics and empirical coverage; see tables and charts above for per-author counts and cadence.