9  Statistical evidence — adjusted hypothesis tests

Question addressed: After multiple-comparison adjustment, which feature-level tests clear the pre-registered gates for each author in the study panel?

What this chapter shows: Counts and rates of significant tests, feature and family breakdowns, and corrected p-value distributions, all read from the hypothesis-test bundles saved with the analysis run.

Inputs: data/analysis/<slug>_hypothesis_tests.json and the hypothesis_tests field in each *_result.json.

Outputs: tables and figures rendered in this chapter.

Provenance: auto-populated by the next code cell.

9.1 Per-author significance rate

Sorted by significant-test fraction. Shared-byline accounts (mediaite, mediaite-staff) are excluded by the survey gate.

9.2 Top FDR-significant features, grouped by family

Counts only significant == True rows. Family labels come from forensics.analysis.feature_families.FEATURE_FAMILIES — the same six-family grouping that powers per-family Benjamini–Hochberg correction.

9.3 Distribution of corrected p-values (log10)

A flat right tail near log10(p) = 0 and a left mass below log10(0.05) ≈ -1.30 are the visual signature of a healthy correction stack: most tests fail to reject (correct), but a substantive minority clear the threshold (real signal).

9.4 Methodology — hypothesis testing and adjustment

Test battery. Each eligible feature is evaluated with Mann–Whitney (rank-based) and Welch (mean-based) two-sample tests across the pre-registered split; the Kolmogorov–Smirnov test is not used because shape-sensitive rejections dominated features with heavy tails.

Multiple comparisons. Benjamini–Hochberg false-discovery control is applied within each feature family (lexical_richness, readability, sentence_structure, entropy, self_similarity, ai_markers) so correlated features share one rejection budget rather than being treated as fully independent.

Computation. Feature time series are cached once per author and reused across windows, which keeps the full panel of tests feasible to compute.

Together, these choices yield the corrected p-value distribution and per-author significance rates shown in this chapter.

Summary: In this artifact set, 10,874 / 111,560 tests (about 9.7%) are significant after adjustment across twelve named authors. tommy-christopher shows the highest author-level significant rate (30.4%; 6,080 / 20,010), followed by colby-hall (17.4%) and zachary-leeman (14.2%). Sentence-structure and readability features account for a large share of significant tests, matching the concentration seen in the feature tables.