Confidence Calibration Framework

How confidence ratings are assigned

Confidence in a Veridi assessment represents the strength of the evidence supporting the verdict, not certainty that the verdict is correct. A claim can receive a clear verdict (FALSE) with Moderate confidence if the evidence is strong but comes only from Tier 2 sources.

As of v2.5, confidence is presented as a verbal band rather than a raw integer percentage:

BandIndicative Range
Near-Certain91-95%
High76-90%
Moderate51-75%
Low26-50%
Speculative≤25%

The structural ceiling imposed by the source tier is shown alongside the band as context (e.g., “High confidence · ceiling: Tier 2 sources only”), making it clear both how strong the evidence is and what structural factor bounds the rating.


Confidence in verdict vs. likelihood

As of v2.3, Veridi explicitly separates two concepts that are often conflated:

  • Confidence in Verdict (what Veridi reports): How well the evidence supports the verdict. A claim rated FALSE with 75% confidence means the evidence strongly points to FALSE, but the sourcing has limitations.
  • Likelihood (the probability the underlying claim is true): A separate question. A claim can be almost certainly false (high likelihood of falsehood) while the available evidence is indirect (moderate confidence in verdict).

This separation follows ICD 203 Standard B, which prohibits mixing analytic confidence with probability assessments. For predictive claims, Veridi also reports a likelihood expression using a standardized verbal probability scale.


Structural ceilings

Confidence is capped by the quality of the best available sourcing:

Sourcing LevelConfidence Ceiling
Multiple Tier 1 sources in agreement95%
Tier 1 + Tier 2 corroboration90%
Tier 2 sources only80%
Tier 3 with corroboration65%
Tier 4 only50%
No sourcing / assertion only25%

These ceilings are structural and non-negotiable. They prevent the system from being tricked by volume; multiple low-quality sources cannot substitute for a single high-quality one.


Field reliability coefficients

Each academic and scientific field has a reliability coefficient reflecting published replication rates and methodological stability. These coefficients are disclosed annotations, not confidence multipliers.

FieldCoefficientSource
Mathematics0.99Expert estimate: proofs are deductively verified
Physics / Chemistry0.95Expert estimate: consistent with high replication rates
Climate Science (physical)0.90Expert estimate: informed by IPCC assessment confidence levels
Engineering0.90Expert estimate: needs empirical validation
Genomics / Molecular Biology0.85Expert estimate: needs empirical validation
Clinical Medicine (RCTs)0.80Ioannidis (2005), subsequent replication efforts
Economics (micro/experimental)0.75Camerer et al. (2016) - ~61% replication rate
Epidemiology / Public Health0.70Expert estimate: informed by meta-analytic variation
Economics (macro)0.60Expert estimate: needs empirical validation
Political Science0.60Expert estimate: needs empirical validation
Psychology (post-2015)0.55Open Science Collaboration (2015) - ~36-39% replication
Social Media / Digital0.55Expert estimate: needs empirical validation
Nutrition Science0.50Expert estimate: informed by Ioannidis (2018)

Sourcing honesty rules

Each coefficient carries an explicit label:

  • Peer-reviewed source: Cites the specific paper or meta-analysis. Only Physics/Chemistry (inferred), Clinical Medicine (Ioannidis), Psychology (OSC 2015), and Economics-micro (Camerer 2016) have strong empirical grounding.
  • Expert estimate: Explicitly labeled as “expert estimate - needs empirical validation.” This is transparent. Presenting unsourced numbers as authoritative is calibration theater.

Within-field variation can be substantial. A nutrition epidemiology claim (~0.40) and a nutrition RCT claim (~0.70) are very different, even though both fall under “Nutrition Science” (0.50). When evaluating a specific claim, the assessment notes if the sub-field diverges significantly from the field average.


How ceilings and coefficients interact

The problem they solve

A prototype used multiplicative interaction: tier ceiling × field coefficient = confidence. This produced absurd results: Tier 2 ceiling 80% × Nutrition 0.55 = 44% confidence for a well-sourced nutrition claim. The math punished good sourcing in contested fields.

The current interaction

Tier ceilings and field coefficients serve different epistemic functions and do not multiply.

  • Tier ceilings cap confidence based on sourcing quality. They answer: “How reliable is our evidence chain?”
  • Field coefficients are disclosure annotations. They answer: “How often do findings in this field hold up over time?”

A well-sourced nutrition claim with Tier 1 evidence gets the appropriate confidence based on the tier ceiling. The field coefficient (0.50) is disclosed as context (the reader should know that nutrition science has a low replication rate) but it does not mechanically reduce the confidence rating.


Brier score tracking

The methodology includes a framework for tracking calibration accuracy over time using Brier scores. For each assessment, the confidence rating is recorded alongside the eventual outcome (when known).

As of v2.5, the Brier protocol defines “outcome” as correspondence to external ground truth — election results, court rulings, scientific replications, regulatory determinations, and similar independently verifiable events — rather than verdict persistence (whether the system would produce the same answer again). Each resolution is tagged with a resolution type from a defined taxonomy, making it clear what kind of external event resolved the claim.

This allows measurement of whether, for example, claims rated at 80% confidence are actually correct approximately 80% of the time. The calibration dataset currently contains 50+ entries in calibration.jsonl, though most are drawn from known test sets with pre-established ground truth. Real calibration value will come from production claims where the outcome is not known at verification time. As the system is used in production, calibration data will accumulate and the framework will provide empirical feedback on confidence accuracy.


Additional calibration mechanisms

Breaking event ceiling: Claims about events less than 72 hours old receive an automatic confidence ceiling reflecting the unreliability of early reporting.

Auto-escalation: If gaming flags are detected during a Standard-tier assessment, the system automatically escalates to Full tier for more thorough analysis. All 12 ADV-v2 claims triggered this mechanism correctly.

Symmetric evidence standards: The same burden of proof is applied to a claim and its counterclaim. This prevents the selective skepticism attack vector, where impossibly high standards are applied to one side while the other is accepted without evidence.

Evidence directness assessment (Standard+): Each assessment classifies evidence as Direct, Partially indirect, or Indirect, noting specific indirectness types (population, context, temporal, metric). This follows GRADE indirectness criteria and helps readers evaluate how closely the cited evidence addresses the specific claim.

Assumptions register (Full+): For non-straightforward verdicts, explicit assumptions are documented with consequence-if-wrong statements. At Forensic tier, assumption sensitivity analysis assesses whether each assumption, if wrong, would change the verdict.