Validation Report

February 25, 2026 | Methodology version: Veridi v2.2


Summary

Veridi was tested against 97 claims spanning eight subject domains, nine verdict categories, and eleven disinformation attack vectors. 96 claims passed outright. One scored partial - a correct verdict with confidence below the expected range due to a source being unavailable at test time.


What was tested

The validation was conducted in three phases:

Phase 1: Baseline (40 claims)

  • 3 smoke tests across different verification tiers (Quick, Standard, Full)
  • 25 golden test set claims (GTS-A) with documented ground truth from established fact-checkers or primary sources, covering all 8 specialist domains and 7 of 9 verdict categories
  • 12 single-vector adversarial claims (ADV-v1) targeting 9 gaming attack patterns, each using real institutions, real phenomena, and plausible statistics

Phase 2: Adversarial stress testing (12 claims)

  • 12 multi-vector adversarial claims (ADV-v2) - each combining 2-3 gaming vectors simultaneously
  • 4 claims based on documented real-world disinformation patterns (VAERS misuse, “died suddenly” narrative, immigration-crime statistics, FEMA hurricane diversion)
  • 2 methodology stress tests (true-facts-false-composite, fabricated citation)
  • 4 claims requiring consultation of the Institutional Reliability Index
  • 2 blocking claims testing the most common real-world attack patterns against public health fact-checking

Phase 3: Gap-filling and edge cases (45 claims)

  • 25 weakness-targeting claims (GTS-B): verdict boundaries, non-Western contexts, statistical manipulation, predictive claims, breaking events, AI-generated content, definitional disputes
  • 20 gap-filling claims (GTS-C): standalone LACKS CONTEXT, expanded MOSTLY TRUE coverage, non-English source evaluation (Japanese, Turkish, Chinese, Hindi), institutional capture scenarios, genuinely contested ground truth

Scoring criteria

Each claim scored as Pass, Partial, or Fail:

  • Pass: Correct verdict, confidence within expected range, and (for adversarial claims) correct gaming flag detected
  • Partial: Correct verdict but confidence outside range, or correct boundary alternative, or gaming flag detected but verdict wrong
  • Fail: Wrong verdict (not the expected boundary alternative), or both gaming flag missed and verdict wrong

Results

Overall

Test SuiteClaimsPassedPartialFailed
Smoke Tests3300
Golden Test Set A252500
Adversarial Suite v1121200
Adversarial Suite v2121200
Golden Test Set B252410
Golden Test Set C202000
Total979610

By domain (GTS-A)

DomainClaimsPassed
Scientific/Technical44
Legal/Regulatory33
Medical/Health33
Financial/Economic33
Electoral/Voting33
Historical33
Technology/Digital33
Propaganda/General33

Adversarial v1: By attack vector

Attack VectorClaimsGaming DetectedVerdict Correct
Confidence laundering222
Citogenesis222
Unverifiable-by-design222
Tier inflation111
Framing manipulation111
Selective skepticism111
Coordinated sourcing111
Preprint pump-and-dump111
Anchoring111

Adversarial v2: Per claim

ClaimAttack VectorsExpectedProducedConfidencePrimary FlagResult
ADV-013: EPA Emissions Datadata-disappearance, framingMOSTLY FALSEMOSTLY FALSE82%DetectedPass
ADV-014: USDA Food Insecuritydata-disappearance, selective-skepticismMISLEADINGMISLEADING80%DetectedPass
ADV-015: CDC Vaccine Guidanceinstitutional-capture, anchoringMOSTLY FALSEMOSTLY FALSE88%DetectedPass
ADV-016: NOAA Climate Researchinstitutional-capture, confidence-launderingFALSEFALSE93%DetectedPass
ADV-017: Athlete Cardiac Eventsframing, anchoring, selective-skepticismMOSTLY FALSEMOSTLY FALSE90%DetectedPass
ADV-018: VAERS Misuseconfidence-laundering, citogenesisFALSEFALSE95%DetectedPass
ADV-019: Immigration Crime Statsframing, data-disappearanceMISLEADINGMISLEADING85%DetectedPass
ADV-020: FEMA Hurricane Diversionanchoring, confidence-laundering, coordinatedMOSTLY FALSEMOSTLY FALSE90%DetectedPass
ADV-021: IARC Processed Meatframing, anchoringMISLEADINGMISLEADING92%DetectedPass
ADV-022: Fabricated Lancet Studytier-inflation, confidence-laundering, anchoringFALSEFALSE88%DetectedPass
ADV-023: Temperature Adjustmentsframing, selective-skepticism, anchoringFALSEFALSE95%DetectedPass
ADV-024: Great Reset Conspiracyanchoring, framing, unverifiable-by-designMOSTLY FALSEMOSTLY FALSE92%DetectedPass

Adversarial v2: Pass criteria

CriterionThresholdActual
Claims PASS≥8 of 1212 of 12
PARTIAL limit≤30
Blocking: ADV-015 (CDC institutional capture)Must PASSPASS
Blocking: ADV-018 (VAERS misuse)Must PASSPASS
Primary gaming flags≥10 of 1212 of 12
Total gaming flags≥16 of ~3039

GTS-B: By category

CategoryClaimsPassedPartial
Verdict Boundary Cases550
Non-Western Context541
Statistical Manipulation550
Predictive Claims330
Breaking Event Scenarios330
AI-Generated Content220
Definitional Disputes220

The single partial (GTS-033, Gaza rebuilding video): correct verdict (FALSE) but confidence 80% versus expected 85-92% because the specific Misbar fact-check article was unavailable at test time, limiting sourcing to Tier 2. The methodology correctly applied its Tier 2 confidence ceiling. This reveals a source-availability limitation.

GTS-C: Gap coverage

Gap TargetedClaimsPassed
LACKS CONTEXT standalone55
MOSTLY TRUE expansion44
Non-English source required44
Institutional capture (IRI)55
Genuinely contested ground truth66

Strengths

Verdict accuracy: 96/97 correct across claims deliberately designed to be confusing, including 18 boundary cases, 24 adversarial scenarios, and 6 genuinely contested topics.

Boundary resolution: All 18 verdict boundary tests resolved to the expected side, including the Misleading/Lacks Context and Mixed/Mostly False distinctions.

Gaming detection under realistic conditions: All 24 adversarial attack vectors detected. The v2 suite detected 39 total flags against approximately 30 expected, including secondary and tertiary vectors.

Institutional Reliability Index: Correctly applied to override historically Tier 1 sources (EPA, USDA, CDC, NOAA) based on documented institutional degradation. Correctly not applied to historical scientific methodology that predates the degradation (ADV-023).

Wild-caught disinformation: 4 claims based on real-world patterns (VAERS misuse, “died suddenly,” immigration-crime stats, FEMA diversion) handled correctly through analytical process - not by matching against known debunked claims.

Contested ground truth: 6 claims on genuinely ambiguous topics (COVID-19 origins, learning loss projections, minimum wage effects, affirmative action outcomes, nuclear safety, Cochrane masking review) produced correct verdicts with appropriately wide confidence ranges.

Non-English evaluation: Claims requiring Japanese, Turkish, Chinese, and Hindi source evaluation all passed.


Limitations

Near-perfect results warrant scrutiny. The test suite was designed by the same people who built the methodology. While it expanded substantially in Phase 3, external validation - where neither the claims nor the expected results are designed by the methodology’s authors - would provide stronger evidence.

Validation by the methodology’s own implementation. The fact-checks were performed by AI following the Veridi methodology. This tests whether the methodology produces correct results when followed, but does not test whether human volunteers can follow it correctly. Usability testing is a separate and necessary step.

Adversarial claims were mostly constructed. The v2 suite improved on v1 by including 4 wild-caught patterns and requiring multi-vector detection, but even the wild-caught claims were adapted for testing rather than submitted verbatim.

Scale testing has not been conducted. The methodology has been validated on 97 claims but has not been used in continuous production at scale.

Brier score calibration is pending. The confidence calibration framework includes a Brier score tracking mechanism, but insufficient data points have accumulated for statistical significance.

We are sensitive to the fact that passing every single test may indicate a weakness in the test suite or in the validation criteria, rather than a strength in the system. If you know of, or can frame, a test that Veridi will fail, we welcome the challenge and look forward to learning from it.


Full per-claim scorecards, evidence summaries, decision tree paths, and gaming countermeasure analyses are available in the methodology files.