Ensemble Validation

Testing the Wright Thesis Against CAMS v2.3 Scoring — Germany & USA, 1880–2026

CAMS Ensemble Analysis: Testing the Wright et al. Thesis — comprehensive summary chart
Summary. Variance reduction exactly equals 1/√N (observed = 0.447, theory = 0.447). SNR improves by √5 across Germany and USA. Ensemble mean is unbiased. Uncertainty correlates positively with Stress (r = 0.53–0.78) and negatively with Node Value (r = −0.75): disagreement reflects contested history, not rater noise.
N = 5 raters · Germany & USA Variance reduction: exactly 1/√N ✓ SNR improvement: exactly √5 ✓ Ensemble mean unbiased ✓ Cross-node uncertainty r = 0.56–0.82 Lag-1 autocorrelation r ≈ 0.82

Part I

The Wright Thesis: Confirmed with Qualifications

The core statistical claim holds precisely. With N = 5 raters, ensemble variance reduces by exactly 1/√N — the observed SEM is 0.447 against a theoretical 0.447. Signal-to-noise ratio improves by √5 = 2.236×: Germany moves from SNR 19.2 (single rater) to 42.9 (ensemble); USA from 35.8 to 80.0. The ensemble mean is effectively unbiased (Germany bias = +0.014, USA = −0.062). Simulated ensemble correlation with ground truth exceeds any individual rater, and the mean outperforms the median and trimmed estimators in every test.

2.236× SNR improvement (exactly √5) — Germany and USA
0.447 Observed SEM = 1/√5 · SD — matches theory exactly
±0.014 / −0.062 Ensemble mean bias — Germany / USA (effectively zero)
Mean > Median Ensemble mean is optimal aggregator in all tests
Qualification — the noise is not idiosyncratic
Cross-node uncertainty correlation is r = 0.56 (Germany) and r = 0.82 (USA) — far above the <0.3 expected for uncorrelated rater noise. Uncertainty is strongly autoregressive (lag-1 r ≈ 0.82) and rises with Stress (r = 0.78) while falling with Node Value (r = −0.75). Ensemble disagreement therefore measures epistemic ambiguity about genuinely contested history, not rater sloppiness. The Wright Thesis holds as a statistical fact; its interpretation requires this caveat.
System Node Value with Rater Envelope and Uncertainty vs Stress — Germany and USA
Fig. 1. System Node Value with rater envelope (shaded band = 1 SEM) and uncertainty vs System Stress — Germany (left) and USA (right). Crisis periods highlighted. Rater spread widens precisely during historically contested years.
Signal-to-Noise Ratio over time, SNR distribution comparison, Uncertainty vs Signal Magnitude
Fig. 2. Top: SNR over time — Germany (blue) and USA (green), with low-SNR threshold marked. Bottom-left: SNR distribution showing Germany's lower SNR reflects genuine historical contestation, not inferior data. Bottom-right: uncertainty vs signal magnitude, confirming the negative correlation between rater consensus and Node Value.
Corrected Node Value Envelope and Uncertainty vs Stress — Germany and USA
Fig. 3. USA envelope corrected for inter-rater calibration (right). Germany (left) remains unchanged. Uncertainty rises monotonically with System Stress in both cases (r = 0.53–0.78), confirming that high-disagreement periods are structurally stressful periods.
CAMS Ensemble Analysis: full Wright Thesis dashboard — variance reduction, SNR, ensemble bias test
Fig. 4. Full Wright Thesis dashboard. Top row: variance reduction (observed = theory), SNR improvement (≈√N), cross-node uncertainty correlation. Middle: uncertainty vs Stress regressions. Bottom: SNR by crisis vs peace period, ensemble bias test (mean bias ≈ 0). Key findings inset confirms all primary claims.

Part II

Robustness & Optimal Aggregation

Diminishing returns set in after N = 3 raters. The marginal SNR gain from adding a fourth or fifth rater is minimal, though the fifth rater still contributes to bias reduction. Crisis detection F1 improves modestly with ensemble size (Germany +0.003, USA +0.007 at N = 5 vs single rater) — the benefit is real but not transformative for already-strong detectors. Node-level analysis reveals systematic patterns: Hands and Helm are the most contested nodes in both countries; Archive and Lore show highest inter-rater consensus.

N = 3 Diminishing returns begin — optimal rater number for most use cases
+0.003 / +0.007 Crisis detection F1 gain (Germany / USA) from 1→5 raters
Hands · Helm Most contested nodes — highest rater uncertainty in both countries
Archive · Lore Highest consensus nodes — lowest inter-rater SEM
Aggregation verdict
The ensemble mean is the optimal aggregator across all tests. Median and trimmed mean offer no measurable advantage and should not replace the mean. Per cams_framework_v2_3.py, the ensemble mean remains the canonical estimator; the SEM envelope is the confidence qualifier.
Robustness curves, crisis detection F1, temporal autocorrelation, node-level uncertainty and SNR
Fig. 5. Top row: robustness curve (correlation vs N raters), diminishing returns after N=3, optimal aggregation comparison. Middle: crisis detection F1 scores, temporal autocorrelation of uncertainty by lag. Bottom: node-level uncertainty and SNR — Germany (left) and USA (right). Hands and Helm consistently show widest rater spread.

Part III

CAMS Signature Detection & Contestation

The Contestation Index — normalised ensemble SEM — is validated as a new CAMS diagnostic. Its peaks match known historiographical battlegrounds with high precision. Germany's index peaks at 1921–1923 (Weimar hyperinflation and institutional collapse) and 1930–1932 (Nazi ascent), while the USA peaks at 1929–1933 (Great Depression) and 2008–2009 (GFC). These are exactly the periods where historical interpretation is most contested in the scholarly literature.

κ reliability warning
The κ criticality index shows dramatically widened confidence bounds during crisis years. In Germany 1933: κ = 17.25 [13.19, 24.91]. In USA 2020: κ = 8.55 [7.66, 9.66]. The wide bounds in contested periods mean κ is a less reliable early-warning signal precisely when history is most ambiguous. This does not invalidate κ — it correctly identifies when to hedge the call.
1921–1923 Germany contestation peak — Weimar hyperinflation & collapse
1930–1932 Germany contestation peak — Nazi ascent
1929–1933 USA contestation peak — Great Depression
2008–2009 USA contestation peak — Global Financial Crisis
κ = 17.25 [13–25] Germany 1933 — wide bounds signal contested historiography
99.56% → 99.89% Executive Decoupling detection — single rater → ensemble

In the CAMS phase space, high Stress + low V̄ + high σ_V reliably predicts high uncertainty. Executive Decoupling detection improves from 99.56% (single rater) to 99.89% (ensemble) with 79% variance reduction — a practically meaningful gain for a high-stakes diagnostic.

Historical Contestation Index, Phase Space, κ Criticality with bounds, Signature Detection
Fig. 6. Top row: Historical Contestation Index time series (Germany and USA), phase space coloured by uncertainty. Middle: κ criticality with 90% confidence bounds — note widening during 1933 and crisis periods generally. Bottom: uncertainty vs System Health (left), uncertainty vs Node Dispersion (right), signature detection comparison (ensemble vs single rater).

Operational Implications for CAMS

These findings do not revise the CAMS framework — they refine how its outputs should be reported and qualified.

Rule 1 — Canonical estimator
The ensemble mean remains the canonical estimator for all CAMS node values. No alternative aggregator (median, trimmed mean) improves on it. This confirms current practice in cams_framework_v2_3.py.
Rule 2 — Envelope as confidence qualifier
The SEM envelope must be reported alongside all diagnostics, not treated as optional metadata. High-SEM periods change the interpretation of all derived quantities (κ, EDEWS, Contestation Index, Executive Decoupling).
Rule 3 — Amber reliability threshold
When the Contestation Index exceeds 0.7, all κ thresholds and Executive Decoupling calls should carry an amber reliability warning. The signal is still meaningful, but the confidence bounds are wide enough that binary pass/fail calls are misleading. Report as a range, not a point estimate.
What the ensemble does not do
The ensemble does not merely cancel random noise. Ensemble disagreement is structured, autocorrelated, and historiographically meaningful. A high-SEM period is not a data quality problem — it is a finding. The model is telling you that historians disagree here, and so should you.

Downloads

Full analysis paper and figures from the Kimi ensemble validation run, 2026-05-29.