2  Pre-Registration & Power Analysis

Forensic question: Are our hypotheses, methods, and thresholds defined before we look at the data?

Input artifacts: - config.toml — pipeline configuration and author roster - docs/pre_registration.md — frozen snapshot output from this notebook

Output artifacts: - docs/pre_registration.md — timestamped pre-registration snapshot

Run metadata: (auto-populated by first code cell) - Config hash: {config_hash} - Corpus hash: {corpus_hash} - Timestamp: {run_timestamp} - Software versions: {python_version}, {key_package_versions}

2.0.1 Hypotheses (directional)

Family Expected under AI-like shift
Lexical richness Lower TTR, lower hapax ratio
Marker phrases Higher AI marker frequency
Embeddings Move toward AI baseline centroid
Probability (language-model) Lower perplexity, lower burstiness

2.0.2 Change-point window

Primary window: post 2022-11-30 (ChatGPT release) with a 6-month ramp allowance for ecosystem adoption.

2.0.3 Primary vs exploratory

Confirmatory: lexical, marker, embedding-centroid hypotheses above feed FindingStrength. Exploratory: any post-hoc feature flagged after initial review (documented separately in notebook 07).

2.0.4 Statistical thresholds

  • α = 0.05 after Benjamini–Hochberg
  • Cohen’s d ≥ 0.5 for emphasis
  • STRONG convergence narrative: ≥3 features within 90 days

Summary finding: Pre-registration, thresholds, and confirmatory families are documented before outcome review, with a frozen snapshot written to docs/pre_registration.md.