Validation Report
February 25, 2026 | Methodology version: Veridi v2.2
Summary
Veridi was tested against 97 claims spanning eight subject domains, nine verdict categories, and eleven disinformation attack vectors. 96 claims passed outright. One scored partial - a correct verdict with confidence below the expected range due to a source being unavailable at test time.
What was tested
The validation was conducted in three phases:
Phase 1: Baseline (40 claims)
- 3 smoke tests across different verification tiers (Quick, Standard, Full)
- 25 golden test set claims (GTS-A) with documented ground truth from established fact-checkers or primary sources, covering all 8 specialist domains and 7 of 9 verdict categories
- 12 single-vector adversarial claims (ADV-v1) targeting 9 gaming attack patterns, each using real institutions, real phenomena, and plausible statistics
Phase 2: Adversarial stress testing (12 claims)
- 12 multi-vector adversarial claims (ADV-v2) - each combining 2-3 gaming vectors simultaneously
- 4 claims based on documented real-world disinformation patterns (VAERS misuse, “died suddenly” narrative, immigration-crime statistics, FEMA hurricane diversion)
- 2 methodology stress tests (true-facts-false-composite, fabricated citation)
- 4 claims requiring consultation of the Institutional Reliability Index
- 2 blocking claims testing the most common real-world attack patterns against public health fact-checking
Phase 3: Gap-filling and edge cases (45 claims)
- 25 weakness-targeting claims (GTS-B): verdict boundaries, non-Western contexts, statistical manipulation, predictive claims, breaking events, AI-generated content, definitional disputes
- 20 gap-filling claims (GTS-C): standalone LACKS CONTEXT, expanded MOSTLY TRUE coverage, non-English source evaluation (Japanese, Turkish, Chinese, Hindi), institutional capture scenarios, genuinely contested ground truth
Scoring criteria
Each claim scored as Pass, Partial, or Fail:
- Pass: Correct verdict, confidence within expected range, and (for adversarial claims) correct gaming flag detected
- Partial: Correct verdict but confidence outside range, or correct boundary alternative, or gaming flag detected but verdict wrong
- Fail: Wrong verdict (not the expected boundary alternative), or both gaming flag missed and verdict wrong
Results
Overall
| Test Suite | Claims | Passed | Partial | Failed |
|---|---|---|---|---|
| Smoke Tests | 3 | 3 | 0 | 0 |
| Golden Test Set A | 25 | 25 | 0 | 0 |
| Adversarial Suite v1 | 12 | 12 | 0 | 0 |
| Adversarial Suite v2 | 12 | 12 | 0 | 0 |
| Golden Test Set B | 25 | 24 | 1 | 0 |
| Golden Test Set C | 20 | 20 | 0 | 0 |
| Total | 97 | 96 | 1 | 0 |
By domain (GTS-A)
| Domain | Claims | Passed |
|---|---|---|
| Scientific/Technical | 4 | 4 |
| Legal/Regulatory | 3 | 3 |
| Medical/Health | 3 | 3 |
| Financial/Economic | 3 | 3 |
| Electoral/Voting | 3 | 3 |
| Historical | 3 | 3 |
| Technology/Digital | 3 | 3 |
| Propaganda/General | 3 | 3 |
Adversarial v1: By attack vector
| Attack Vector | Claims | Gaming Detected | Verdict Correct |
|---|---|---|---|
| Confidence laundering | 2 | 2 | 2 |
| Citogenesis | 2 | 2 | 2 |
| Unverifiable-by-design | 2 | 2 | 2 |
| Tier inflation | 1 | 1 | 1 |
| Framing manipulation | 1 | 1 | 1 |
| Selective skepticism | 1 | 1 | 1 |
| Coordinated sourcing | 1 | 1 | 1 |
| Preprint pump-and-dump | 1 | 1 | 1 |
| Anchoring | 1 | 1 | 1 |
Adversarial v2: Per claim
| Claim | Attack Vectors | Expected | Produced | Confidence | Primary Flag | Result |
|---|---|---|---|---|---|---|
| ADV-013: EPA Emissions Data | data-disappearance, framing | MOSTLY FALSE | MOSTLY FALSE | 82% | Detected | Pass |
| ADV-014: USDA Food Insecurity | data-disappearance, selective-skepticism | MISLEADING | MISLEADING | 80% | Detected | Pass |
| ADV-015: CDC Vaccine Guidance | institutional-capture, anchoring | MOSTLY FALSE | MOSTLY FALSE | 88% | Detected | Pass |
| ADV-016: NOAA Climate Research | institutional-capture, confidence-laundering | FALSE | FALSE | 93% | Detected | Pass |
| ADV-017: Athlete Cardiac Events | framing, anchoring, selective-skepticism | MOSTLY FALSE | MOSTLY FALSE | 90% | Detected | Pass |
| ADV-018: VAERS Misuse | confidence-laundering, citogenesis | FALSE | FALSE | 95% | Detected | Pass |
| ADV-019: Immigration Crime Stats | framing, data-disappearance | MISLEADING | MISLEADING | 85% | Detected | Pass |
| ADV-020: FEMA Hurricane Diversion | anchoring, confidence-laundering, coordinated | MOSTLY FALSE | MOSTLY FALSE | 90% | Detected | Pass |
| ADV-021: IARC Processed Meat | framing, anchoring | MISLEADING | MISLEADING | 92% | Detected | Pass |
| ADV-022: Fabricated Lancet Study | tier-inflation, confidence-laundering, anchoring | FALSE | FALSE | 88% | Detected | Pass |
| ADV-023: Temperature Adjustments | framing, selective-skepticism, anchoring | FALSE | FALSE | 95% | Detected | Pass |
| ADV-024: Great Reset Conspiracy | anchoring, framing, unverifiable-by-design | MOSTLY FALSE | MOSTLY FALSE | 92% | Detected | Pass |
Adversarial v2: Pass criteria
| Criterion | Threshold | Actual |
|---|---|---|
| Claims PASS | ≥8 of 12 | 12 of 12 |
| PARTIAL limit | ≤3 | 0 |
| Blocking: ADV-015 (CDC institutional capture) | Must PASS | PASS |
| Blocking: ADV-018 (VAERS misuse) | Must PASS | PASS |
| Primary gaming flags | ≥10 of 12 | 12 of 12 |
| Total gaming flags | ≥16 of ~30 | 39 |
GTS-B: By category
| Category | Claims | Passed | Partial |
|---|---|---|---|
| Verdict Boundary Cases | 5 | 5 | 0 |
| Non-Western Context | 5 | 4 | 1 |
| Statistical Manipulation | 5 | 5 | 0 |
| Predictive Claims | 3 | 3 | 0 |
| Breaking Event Scenarios | 3 | 3 | 0 |
| AI-Generated Content | 2 | 2 | 0 |
| Definitional Disputes | 2 | 2 | 0 |
The single partial (GTS-033, Gaza rebuilding video): correct verdict (FALSE) but confidence 80% versus expected 85-92% because the specific Misbar fact-check article was unavailable at test time, limiting sourcing to Tier 2. The methodology correctly applied its Tier 2 confidence ceiling. This reveals a source-availability limitation.
GTS-C: Gap coverage
| Gap Targeted | Claims | Passed |
|---|---|---|
| LACKS CONTEXT standalone | 5 | 5 |
| MOSTLY TRUE expansion | 4 | 4 |
| Non-English source required | 4 | 4 |
| Institutional capture (IRI) | 5 | 5 |
| Genuinely contested ground truth | 6 | 6 |
Strengths
Verdict accuracy: 96/97 correct across claims deliberately designed to be confusing, including 18 boundary cases, 24 adversarial scenarios, and 6 genuinely contested topics.
Boundary resolution: All 18 verdict boundary tests resolved to the expected side, including the Misleading/Lacks Context and Mixed/Mostly False distinctions.
Gaming detection under realistic conditions: All 24 adversarial attack vectors detected. The v2 suite detected 39 total flags against approximately 30 expected, including secondary and tertiary vectors.
Institutional Reliability Index: Correctly applied to override historically Tier 1 sources (EPA, USDA, CDC, NOAA) based on documented institutional degradation. Correctly not applied to historical scientific methodology that predates the degradation (ADV-023).
Wild-caught disinformation: 4 claims based on real-world patterns (VAERS misuse, “died suddenly,” immigration-crime stats, FEMA diversion) handled correctly through analytical process - not by matching against known debunked claims.
Contested ground truth: 6 claims on genuinely ambiguous topics (COVID-19 origins, learning loss projections, minimum wage effects, affirmative action outcomes, nuclear safety, Cochrane masking review) produced correct verdicts with appropriately wide confidence ranges.
Non-English evaluation: Claims requiring Japanese, Turkish, Chinese, and Hindi source evaluation all passed.
Limitations
Near-perfect results warrant scrutiny. The test suite was designed by the same people who built the methodology. While it expanded substantially in Phase 3, external validation - where neither the claims nor the expected results are designed by the methodology’s authors - would provide stronger evidence.
Validation by the methodology’s own implementation. The fact-checks were performed by AI following the Veridi methodology. This tests whether the methodology produces correct results when followed, but does not test whether human volunteers can follow it correctly. Usability testing is a separate and necessary step.
Adversarial claims were mostly constructed. The v2 suite improved on v1 by including 4 wild-caught patterns and requiring multi-vector detection, but even the wild-caught claims were adapted for testing rather than submitted verbatim.
Scale testing has not been conducted. The methodology has been validated on 97 claims but has not been used in continuous production at scale.
Brier score calibration is pending. The confidence calibration framework includes a Brier score tracking mechanism, but insufficient data points have accumulated for statistical significance.
We are sensitive to the fact that passing every single test may indicate a weakness in the test suite or in the validation criteria, rather than a strength in the system. If you know of, or can frame, a test that Veridi will fail, we welcome the challenge and look forward to learning from it.
Full per-claim scorecards, evidence summaries, decision tree paths, and gaming countermeasure analyses are available in the methodology files.