8  Embedding drift

Question addressed: Whether each author’s embedding-space summaries and AI-baseline distance series show persistent shifts over time, as captured by the embedding-drift score and related convergence flags.

What this chapter shows: Using the stored drift bundles and convergence windows under data/analysis/, the chapter summarizes how strongly the embedding-drift channel fires, how often it appears without simultaneous stylometric confirmation (drift_only versus ab in passes_via), and plots velocity summaries. Rankings and counts are descriptive outputs from the artifacts present at render time.

Inputs: per-author drift JSON, baseline curve JSON, centroid archives, and the convergence_windows entries in each *_result.json.

Outputs: summary tables and figures inline with the narrative.

Provenance: filled by the first code cell.

8.1 Methodology — embedding drift score

The embedding-drift channel compares monthly centroid motion, variance trends, and distance to reference centroids (including an AI-style baseline). In the shipped configuration:

  • AI-baseline distance is evaluated in percentile form per author so the score reflects each author’s own distribution rather than a single absolute cutoff.
  • Drift declaration uses the pipeline_b_score threshold recorded in analysis settings (currently 0.3 in the reference configuration).
  • drift_only in passes_via allows a window to register on embedding drift alone when the stylometric ratio test does not also pass; ab marks windows where both channels register.

Embedding vectors must be listed in data/embeddings/manifest.jsonl for drift summaries to load. If drift cache files are missing while embeddings exist, the analysis layer logs a warning when drift summaries are read.

8.2 Per-author Pipeline B summary

For each author we surface:

  • pb_max — maximum pipeline_b_score across persisted convergence windows
  • drift_only_count — windows that persist via the new drift_only channel only
  • ab_count — windows that pass via ab (lexical ratio AND embedding drift) — strongest evidence
  • ratio_count — windows that pass via ratio (lexical/family ratio test)
  • total_windows — total persisted convergence windows for the author

Sorted by drift_only_count descending so the heaviest drift signal appears at the top.

8.3 Monthly centroid velocity — top-3 authors by drift-only count

For the three authors with the highest drift-only window counts, plot the month-over-month centroid velocity (cosine distance between consecutive monthly centroids). Velocity spikes mark months in which the author’s average semantic fingerprint moved sharply.

When author_slug is overridden via -P, the chart shows that author only.

8.4 Distribution of pipeline_b_score across persisted windows

Histogram of pipeline_b_score for every persisted convergence window across the study authors, overlaid by author. The configured drift threshold (0.3 in the reference settings) is drawn as a vertical reference; windows to the right qualify as drift-positive. The right-hand tail shows mass that can register through the drift_only path in passes_via as well as through joint ab windows.

8.5 Diagnostic block — drift artifact presence

When drift cache files are missing but embeddings exist, the drift loader emits a warning of the form:

drift summary: missing artifact <label> for slug=<slug> but embeddings exist on disk

This cell checks artifact paths directly so missing files are visible even if the analysis log was not reviewed. A complete run should report 0 missing artifacts.

8.6 Summary

In the current artifact set, embedding-drift windows appear for 12/12 named study authors. Drift-only persisted windows total 8,042 across the panel. Among named authors, pb_max ranges from 0.520 (zachary-leeman) to 0.598 (david-gilmour). tommy-christopher has the largest count of drift-only windows (1,070); michael-luciano shows the largest drift-only volume without simultaneous stylometric confirmation (958 drift-only / 0 ab). colby-hall records 170 windows where both channels register (ab), the highest such count in this cohort.

Together with the feature and hypothesis-test chapters, these results describe an embedding-space channel that can move independently of the stylometric ratio test.