Confidence Calibration Framework
How confidence ratings are assigned
Confidence in a Veridi assessment represents the strength of the evidence supporting the verdict, not certainty that the verdict is correct. A claim can receive a clear verdict (FALSE) with moderate confidence (75%) if the evidence is strong but comes only from Tier 2 sources.
Structural ceilings
Confidence is capped by the quality of the best available sourcing:
| Sourcing Level | Confidence Ceiling |
|---|---|
| Multiple Tier 1 sources in agreement | 95% |
| Tier 1 + Tier 2 corroboration | 90% |
| Tier 2 sources only | 80% |
| Tier 3 with corroboration | 65% |
| Tier 4 only | 50% |
| No sourcing / assertion only | 25% |
These ceilings are structural and non-negotiable. They prevent the system from being tricked by volume; multiple low-quality sources cannot substitute for a single high-quality one.
Field reliability coefficients
Each academic and scientific field has a reliability coefficient reflecting published replication rates and methodological stability. These coefficients are disclosed annotations, not confidence multipliers.
| Field | Coefficient | Source |
|---|---|---|
| Mathematics | 0.99 | Expert estimate: proofs are deductively verified |
| Physics / Chemistry | 0.95 | Expert estimate: consistent with high replication rates |
| Climate Science (physical) | 0.90 | Expert estimate: informed by IPCC assessment confidence levels |
| Engineering | 0.90 | Expert estimate: needs empirical validation |
| Genomics / Molecular Biology | 0.85 | Expert estimate: needs empirical validation |
| Clinical Medicine (RCTs) | 0.80 | Ioannidis (2005), subsequent replication efforts |
| Economics (micro/experimental) | 0.75 | Camerer et al. (2016) - ~61% replication rate |
| Epidemiology / Public Health | 0.70 | Expert estimate: informed by meta-analytic variation |
| Economics (macro) | 0.60 | Expert estimate: needs empirical validation |
| Political Science | 0.60 | Expert estimate: needs empirical validation |
| Psychology (post-2015) | 0.55 | Open Science Collaboration (2015) - ~36-39% replication |
| Social Media / Digital | 0.55 | Expert estimate: needs empirical validation |
| Nutrition Science | 0.50 | Expert estimate: informed by Ioannidis (2018) |
Sourcing honesty rules
Each coefficient carries an explicit label:
- Peer-reviewed source: Cites the specific paper or meta-analysis. Only Physics/Chemistry (inferred), Clinical Medicine (Ioannidis), Psychology (OSC 2015), and Economics-micro (Camerer 2016) have strong empirical grounding.
- Expert estimate: Explicitly labeled as “expert estimate - needs empirical validation.” This is honest. Presenting unsourced numbers as authoritative is calibration theater.
Within-field variation can be substantial. A nutrition epidemiology claim (~0.40) and a nutrition RCT claim (~0.70) are very different, even though both fall under “Nutrition Science” (0.50). When evaluating a specific claim, the assessment notes if the sub-field diverges significantly from the field average.
How ceilings and coefficients interact
The problem they solve
A prototype used multiplicative interaction: tier ceiling × field coefficient = confidence. This produced absurd results: Tier 2 ceiling 80% × Nutrition 0.55 = 44% confidence for a well-sourced nutrition claim. The math punished good sourcing in contested fields.
The current interaction
Tier ceilings and field coefficients serve different epistemic functions and do not multiply.
- Tier ceilings cap confidence based on sourcing quality. They answer: “How reliable is our evidence chain?”
- Field coefficients are disclosure annotations. They answer: “How often do findings in this field hold up over time?”
A well-sourced nutrition claim with Tier 1 evidence gets the appropriate confidence based on the tier ceiling. The field coefficient (0.50) is disclosed as context (the reader should know that nutrition science has a low replication rate) but it does not mechanically reduce the confidence rating.
Brier score tracking
The methodology includes a framework for tracking calibration accuracy over time using Brier scores. For each assessment, the confidence rating is recorded alongside the eventual outcome (when known). This allows measurement of whether, for example, claims rated at 80% confidence are actually correct approximately 80% of the time.
This tracking system is designed but has not yet accumulated sufficient data points for statistical significance. As the system is used in production, calibration data will accumulate and the framework will provide empirical feedback on confidence accuracy.
Additional calibration mechanisms
Breaking event ceiling: Claims about events less than 72 hours old receive an automatic confidence ceiling reflecting the unreliability of early reporting.
Auto-escalation: If gaming flags are detected during a Standard-tier assessment, the system automatically escalates to Full tier for more thorough analysis. All 12 ADV-v2 claims triggered this mechanism correctly.
Symmetric evidence standards: The same burden of proof is applied to a claim and its counterclaim. This prevents the selective skepticism attack vector, where impossibly high standards are applied to one side while the other is accepted without evidence.