Chi-Square & Cramér's V · No Machine Learning

Overall accuracy is an average. Averages hide where a system fails.

CONFIRM is a statistical audit for any system that sorts things into categories — reward models, classifiers, lithofacies predictions. It doesn't ask "how good is this on average." It asks whether the system is consistently reliable across every category it claims to handle, and proves the answer with a contingency table anyone can re-check by hand.

Request an Audit See the 174-model study ↓

Zero AI

Illustrative Model · 5 × 2 Tablen = 1,763

Domain	Correct	Incorrect	Accuracy
Safety	441	9	98
Focus	396	99	80
Factuality	361	114	76
Math	122	61	67
Precise IF	61	99	38

Pattern drawn from the published case study (RAMO‑Llama3.1‑8B, Cramér's V = 0.467) — 98 on Safety, 38 on Precise Instruction Following. Same model. 60‑point gap. One overall score would have shown neither number.

01What It
Measures

Two questions, asked of any classification system.

When a system sorts cases into categories — sandstone or shale, approve or decline, correct or incorrect — and you have ground truth to check it against, CONFIRM runs the comparison through two classical tests.

Chi-square test

Is the relationship between the system's calls and the actual outcomes real — or could a pattern this strong have appeared by chance alone? This is the question every contingency table in this study had to answer before anything else was reported.

Cramér's V

How strong is that relationship, and does it hold evenly across every category the system is supposed to handle? V runs from 0 (no association between category and outcome) to 1. A system that's excellent everywhere has low V. A system that's poor everywhere also has low V — its failures are as consistent as the other's successes. High V means the reliability is uneven: strong in some categories, weak in others.

What a grade isn't

A CONFIRM grade is not a verdict on whether to use a system. It's a measurement of how unevenly its reliability is distributed. Whether that unevenness matters depends entirely on what you need the system to do, and where you can least afford it to fail.

CONFIRM contains no machine learning, no neural networks, no language models. Chi-square and Cramér's V are early-20th-century statistics, in continuous use across medicine, psychology, and engineering. Any result here can be reproduced with a contingency table and a pencil.

02The
RewardBench
2 Study

174 reward models. Every one of them came back uneven.

Reward models are the second model in the room — they score a large language model's candidate answers during training, and that score becomes the training signal. CONFIRM ran a 5×2 contingency table (domain × correct/incorrect) on every reward model publicly scored against RewardBench 2, a benchmark from the Allen Institute for AI covering Factuality, Math, Precise Instruction Following, Focus, and Safety.

174 / 174Models rejected independence

0.272Median Cramér's V

93.1%Weakest at Precise IF

70.7%Strongest at Safety

p < 0.02Worst-case significance

Where the 174 models fall on Cramér's V

Each mark is one model, positioned by association strength. An illustrative spread anchored to the reported percentiles — not the literal per-model dataset.

0.050.10 — grade C0.20 — grade B0.30 — grade A0.467 max

Illustrative Case Study

RAMO‑Llama3.1‑8B — Cramér's V 0.467, the most uneven model in the sample

Two domains, same model, same leaderboard entry:

98 / 100Safety

38 / 100Precise Instruction Following

60 ptsSpread, same model

A single leaderboard number for this model would average to something respectable. It would not tell you that anything built on its instruction-following judgment inherits a blind spot the headline score never shows.

03CONFIRM v2
Grading Scale

Grades are anchored to measured effect sizes — not arbitrary cutoffs.

CONFIRM v2 thresholds are calibrated to meta-analytic effect-size benchmarks (Funder & Ozer, 2019; Gignac & Szodorai, 2016) rather than Cohen's widely-cited but, by current consensus, overly strict 1988 conventions.

Grade	Cramér's V (p ≤ 0.05)	Reading
A	≥ 0.30	Strong, uneven pattern — roughly 65 vs. 35 per 100 across domains
B	0.20 – 0.30	Clear pattern — roughly 60 vs. 40 per 100
C	0.10 – 0.20	Modest pattern — roughly 55 vs. 45 per 100
D	0.05 – 0.10	Faint but statistically real pattern
F	< 0.05, powered	No distinguishable unevenness detected
I	underpowered	Too little data to tell — not the same as "flat"

In the 174-model study: 64 graded A, 88 graded B, 20 graded C, 2 graded D. None graded F or I — every model had enough data to detect even a small real pattern, and every model showed one.

04Why This
Isn't Just
About AI

CONFIRM didn't start in AI. It started under the ground.

Major oil & gas operators needed to know whether lithofacies classifications coming out of a geophysical model — sandstone or shale, predicted from seismic and well-log data — could be trusted, not just on average, but across every rock type the model claimed to distinguish. A misclassified formation sends a drilling program at the wrong target. The tool didn't exist, so TraceSeis built it.

Where it started

Lithofacies classification

A SOM-based model predicts sandstone, shale, or carbonate from seismic and well-log signatures. CONFIRM checks whether that classification holds up evenly across every formation type, or performs well in some and collapses in others — the difference between a model you can drill on and one you can't.

Where it's applied here

Reward model evaluation

A reward model predicts which of two AI-generated answers a human would prefer, across categories like Safety or Math. CONFIRM checks whether that judgment holds up evenly across every category, or performs well in some and collapses in others — the difference between a training signal you can trust and one with a blind spot baked in.

χ² test of independence → Cramér's V → grade — identical procedure, different categories

Get an Independent Read

Bring us a contingency table. Leave with a defensible answer.

If a system you rely on makes categorical calls — and something downstream depends on it being reliable in every category, not just on average — CONFIRM gives you a reproducible answer in days, not a black box.

Request a CONFIRM Audit Back to TraceSeis.com

TraceSeis, Inc. Houston, Texas
info@traceseis.com

Methods "Subset Heterogeneity in Reward Model Benchmarks: A CONFIRM Validation of 174 RewardBench 2 Models" — A. Chaveste Fernandez, 2026. Available on request.