Confidence Calibration Framework

How confidence ratings are assigned

Confidence in a Veridi assessment represents the strength of the evidence supporting the verdict, not certainty that the verdict is correct. A claim can receive a clear verdict (FALSE) with moderate confidence (75%) if the evidence is strong but comes only from Tier 2 sources.

Structural ceilings

Confidence is capped by the quality of the best available sourcing:

Sourcing Level	Confidence Ceiling
Multiple Tier 1 sources in agreement	95%
Tier 1 + Tier 2 corroboration	90%
Tier 2 sources only	80%
Tier 3 with corroboration	65%
Tier 4 only	50%
No sourcing / assertion only	25%

These ceilings are structural and non-negotiable. They prevent the system from being tricked by volume; multiple low-quality sources cannot substitute for a single high-quality one.

Field reliability coefficients

Each academic and scientific field has a reliability coefficient reflecting published replication rates and methodological stability. These coefficients are disclosed annotations, not confidence multipliers.

Field	Coefficient	Source
Mathematics	0.99	Expert estimate: proofs are deductively verified
Physics / Chemistry	0.95	Expert estimate: consistent with high replication rates
Climate Science (physical)	0.90	Expert estimate: informed by IPCC assessment confidence levels
Engineering	0.90	Expert estimate: needs empirical validation
Genomics / Molecular Biology	0.85	Expert estimate: needs empirical validation
Clinical Medicine (RCTs)	0.80	Ioannidis (2005), subsequent replication efforts
Economics (micro/experimental)	0.75	Camerer et al. (2016) - ~61% replication rate
Epidemiology / Public Health	0.70	Expert estimate: informed by meta-analytic variation
Economics (macro)	0.60	Expert estimate: needs empirical validation
Political Science	0.60	Expert estimate: needs empirical validation
Psychology (post-2015)	0.55	Open Science Collaboration (2015) - ~36-39% replication
Social Media / Digital	0.55	Expert estimate: needs empirical validation
Nutrition Science	0.50	Expert estimate: informed by Ioannidis (2018)

Sourcing honesty rules

Each coefficient carries an explicit label:

Peer-reviewed source: Cites the specific paper or meta-analysis. Only Physics/Chemistry (inferred), Clinical Medicine (Ioannidis), Psychology (OSC 2015), and Economics-micro (Camerer 2016) have strong empirical grounding.
Expert estimate: Explicitly labeled as “expert estimate - needs empirical validation.” This is honest. Presenting unsourced numbers as authoritative is calibration theater.

Within-field variation can be substantial. A nutrition epidemiology claim (~0.40) and a nutrition RCT claim (~0.70) are very different, even though both fall under “Nutrition Science” (0.50). When evaluating a specific claim, the assessment notes if the sub-field diverges significantly from the field average.

How ceilings and coefficients interact

The problem they solve

A prototype used multiplicative interaction: tier ceiling × field coefficient = confidence. This produced absurd results: Tier 2 ceiling 80% × Nutrition 0.55 = 44% confidence for a well-sourced nutrition claim. The math punished good sourcing in contested fields.

The current interaction

Tier ceilings and field coefficients serve different epistemic functions and do not multiply.

Tier ceilings cap confidence based on sourcing quality. They answer: “How reliable is our evidence chain?”
Field coefficients are disclosure annotations. They answer: “How often do findings in this field hold up over time?”

A well-sourced nutrition claim with Tier 1 evidence gets the appropriate confidence based on the tier ceiling. The field coefficient (0.50) is disclosed as context (the reader should know that nutrition science has a low replication rate) but it does not mechanically reduce the confidence rating.

Brier score tracking

The methodology includes a framework for tracking calibration accuracy over time using Brier scores. For each assessment, the confidence rating is recorded alongside the eventual outcome (when known). This allows measurement of whether, for example, claims rated at 80% confidence are actually correct approximately 80% of the time.

This tracking system is designed but has not yet accumulated sufficient data points for statistical significance. As the system is used in production, calibration data will accumulate and the framework will provide empirical feedback on confidence accuracy.

Additional calibration mechanisms

Breaking event ceiling: Claims about events less than 72 hours old receive an automatic confidence ceiling reflecting the unreliability of early reporting.

Auto-escalation: If gaming flags are detected during a Standard-tier assessment, the system automatically escalates to Full tier for more thorough analysis. All 12 ADV-v2 claims triggered this mechanism correctly.

Symmetric evidence standards: The same burden of proof is applied to a claim and its counterclaim. This prevents the selective skepticism attack vector, where impossibly high standards are applied to one side while the other is accepted without evidence.