2.0.3 Primary vs exploratory
Confirmatory: lexical, marker, embedding-centroid hypotheses above feed FindingStrength. Exploratory: any post-hoc feature flagged after initial review (documented separately in notebook 07).
Forensic question: Are our hypotheses, methods, and thresholds defined before we look at the data?
Input artifacts: - config.toml — pipeline configuration and author roster - docs/pre_registration.md — frozen snapshot output from this notebook
Output artifacts: - docs/pre_registration.md — timestamped pre-registration snapshot
Run metadata: (auto-populated by first code cell) - Config hash: {config_hash} - Corpus hash: {corpus_hash} - Timestamp: {run_timestamp} - Software versions: {python_version}, {key_package_versions}
| Family | Expected under AI-like shift |
|---|---|
| Lexical richness | Lower TTR, lower hapax ratio |
| Marker phrases | Higher AI marker frequency |
| Embeddings | Move toward AI baseline centroid |
| Probability (language-model) | Lower perplexity, lower burstiness |
Primary window: post 2022-11-30 (ChatGPT release) with a 6-month ramp allowance for ecosystem adoption.
Confirmatory: lexical, marker, embedding-centroid hypotheses above feed FindingStrength. Exploratory: any post-hoc feature flagged after initial review (documented separately in notebook 07).
Summary finding: Pre-registration, thresholds, and confirmatory families are documented before outcome review, with a frozen snapshot written to docs/pre_registration.md.