CONFIRM is a statistical audit for any system that sorts things into categories — reward models, classifiers, lithofacies predictions. It doesn't ask "how good is this on average." It asks whether the system is consistently reliable across every category it claims to handle, and proves the answer with a contingency table anyone can re-check by hand.
| Domain | Correct | Incorrect | Accuracy |
|---|---|---|---|
| Safety | 441 | 9 | 98 |
| Focus | 396 | 99 | 80 |
| Factuality | 361 | 114 | 76 |
| Math | 122 | 61 | 67 |
| Precise IF | 61 | 99 | 38 |
When a system sorts cases into categories — sandstone or shale, approve or decline, correct or incorrect — and you have ground truth to check it against, CONFIRM runs the comparison through two classical tests.
CONFIRM contains no machine learning, no neural networks, no language models. Chi-square and Cramér's V are early-20th-century statistics, in continuous use across medicine, psychology, and engineering. Any result here can be reproduced with a contingency table and a pencil.
Reward models are the second model in the room — they score a large language model's candidate answers during training, and that score becomes the training signal. CONFIRM ran a 5×2 contingency table (domain × correct/incorrect) on every reward model publicly scored against RewardBench 2, a benchmark from the Allen Institute for AI covering Factuality, Math, Precise Instruction Following, Focus, and Safety.
Each mark is one model, positioned by association strength. An illustrative spread anchored to the reported percentiles — not the literal per-model dataset.
Two domains, same model, same leaderboard entry:
A single leaderboard number for this model would average to something respectable. It would not tell you that anything built on its instruction-following judgment inherits a blind spot the headline score never shows.
CONFIRM v2 thresholds are calibrated to meta-analytic effect-size benchmarks (Funder & Ozer, 2019; Gignac & Szodorai, 2016) rather than Cohen's widely-cited but, by current consensus, overly strict 1988 conventions.
| Grade | Cramér's V (p ≤ 0.05) | Reading |
|---|---|---|
| A | ≥ 0.30 | Strong, uneven pattern — roughly 65 vs. 35 per 100 across domains |
| B | 0.20 – 0.30 | Clear pattern — roughly 60 vs. 40 per 100 |
| C | 0.10 – 0.20 | Modest pattern — roughly 55 vs. 45 per 100 |
| D | 0.05 – 0.10 | Faint but statistically real pattern |
| F | < 0.05, powered | No distinguishable unevenness detected |
| I | underpowered | Too little data to tell — not the same as "flat" |
In the 174-model study: 64 graded A, 88 graded B, 20 graded C, 2 graded D. None graded F or I — every model had enough data to detect even a small real pattern, and every model showed one.
Major oil & gas operators needed to know whether lithofacies classifications coming out of a geophysical model — sandstone or shale, predicted from seismic and well-log data — could be trusted, not just on average, but across every rock type the model claimed to distinguish. A misclassified formation sends a drilling program at the wrong target. The tool didn't exist, so TraceSeis built it.
A SOM-based model predicts sandstone, shale, or carbonate from seismic and well-log signatures. CONFIRM checks whether that classification holds up evenly across every formation type, or performs well in some and collapses in others — the difference between a model you can drill on and one you can't.
A reward model predicts which of two AI-generated answers a human would prefer, across categories like Safety or Math. CONFIRM checks whether that judgment holds up evenly across every category, or performs well in some and collapses in others — the difference between a training signal you can trust and one with a blind spot baked in.
If a system you rely on makes categorical calls — and something downstream depends on it being reliable in every category, not just on average — CONFIRM gives you a reproducible answer in days, not a black box.