Confidence Calibration Framework

How confidence ratings are assigned

Confidence in a Veridi assessment represents the strength of the evidence supporting the verdict, not certainty that the verdict is correct. A claim can receive a clear verdict (FALSE) with moderate confidence (75%) if the evidence is strong but comes only from Tier 2 sources.


Structural ceilings

Confidence is capped by the quality of the best available sourcing:

Sourcing LevelConfidence Ceiling
Multiple Tier 1 sources in agreement95%
Tier 1 + Tier 2 corroboration90%
Tier 2 sources only80%
Tier 3 with corroboration65%
Tier 4 only50%
No sourcing / assertion only25%

These ceilings are structural and non-negotiable. They prevent the system from being tricked by volume; multiple low-quality sources cannot substitute for a single high-quality one.


Field reliability coefficients

Each academic and scientific field has a reliability coefficient reflecting published replication rates and methodological stability. These coefficients are disclosed annotations, not confidence multipliers.

FieldCoefficientSource
Mathematics0.99Expert estimate: proofs are deductively verified
Physics / Chemistry0.95Expert estimate: consistent with high replication rates
Climate Science (physical)0.90Expert estimate: informed by IPCC assessment confidence levels
Engineering0.90Expert estimate: needs empirical validation
Genomics / Molecular Biology0.85Expert estimate: needs empirical validation
Clinical Medicine (RCTs)0.80Ioannidis (2005), subsequent replication efforts
Economics (micro/experimental)0.75Camerer et al. (2016) - ~61% replication rate
Epidemiology / Public Health0.70Expert estimate: informed by meta-analytic variation
Economics (macro)0.60Expert estimate: needs empirical validation
Political Science0.60Expert estimate: needs empirical validation
Psychology (post-2015)0.55Open Science Collaboration (2015) - ~36-39% replication
Social Media / Digital0.55Expert estimate: needs empirical validation
Nutrition Science0.50Expert estimate: informed by Ioannidis (2018)

Sourcing honesty rules

Each coefficient carries an explicit label:

  • Peer-reviewed source: Cites the specific paper or meta-analysis. Only Physics/Chemistry (inferred), Clinical Medicine (Ioannidis), Psychology (OSC 2015), and Economics-micro (Camerer 2016) have strong empirical grounding.
  • Expert estimate: Explicitly labeled as “expert estimate - needs empirical validation.” This is honest. Presenting unsourced numbers as authoritative is calibration theater.

Within-field variation can be substantial. A nutrition epidemiology claim (~0.40) and a nutrition RCT claim (~0.70) are very different, even though both fall under “Nutrition Science” (0.50). When evaluating a specific claim, the assessment notes if the sub-field diverges significantly from the field average.


How ceilings and coefficients interact

The problem they solve

A prototype used multiplicative interaction: tier ceiling × field coefficient = confidence. This produced absurd results: Tier 2 ceiling 80% × Nutrition 0.55 = 44% confidence for a well-sourced nutrition claim. The math punished good sourcing in contested fields.

The current interaction

Tier ceilings and field coefficients serve different epistemic functions and do not multiply.

  • Tier ceilings cap confidence based on sourcing quality. They answer: “How reliable is our evidence chain?”
  • Field coefficients are disclosure annotations. They answer: “How often do findings in this field hold up over time?”

A well-sourced nutrition claim with Tier 1 evidence gets the appropriate confidence based on the tier ceiling. The field coefficient (0.50) is disclosed as context (the reader should know that nutrition science has a low replication rate) but it does not mechanically reduce the confidence rating.


Brier score tracking

The methodology includes a framework for tracking calibration accuracy over time using Brier scores. For each assessment, the confidence rating is recorded alongside the eventual outcome (when known). This allows measurement of whether, for example, claims rated at 80% confidence are actually correct approximately 80% of the time.

This tracking system is designed but has not yet accumulated sufficient data points for statistical significance. As the system is used in production, calibration data will accumulate and the framework will provide empirical feedback on confidence accuracy.


Additional calibration mechanisms

Breaking event ceiling: Claims about events less than 72 hours old receive an automatic confidence ceiling reflecting the unreliability of early reporting.

Auto-escalation: If gaming flags are detected during a Standard-tier assessment, the system automatically escalates to Full tier for more thorough analysis. All 12 ADV-v2 claims triggered this mechanism correctly.

Symmetric evidence standards: The same burden of proof is applied to a claim and its counterclaim. This prevents the selective skepticism attack vector, where impossibly high standards are applied to one side while the other is accepted without evidence.