Validation Report

February 25, 2026 | Methodology version at time of testing: Veridi v2.2 | Current methodology: v2.7. The 97-claim suite below tested the eleven gaming vectors that existed at the time. Vector 12 (Substrate Self-Reference) was added in v2.7 with regression coverage from ADV-025 (v2.6, methodology self-reference) and ADV-026 (v2.7, substrate self-reference); see the v2.7 changelog entry for the additive scope.

Summary

Veridi was tested against 97 claims spanning eight subject domains, nine verdict categories, and eleven disinformation attack vectors. 96 claims passed outright. One scored partial - a correct verdict with confidence below the expected range due to a source being unavailable at test time.

What was tested

The validation was conducted in three phases:

Phase 1: Baseline (40 claims)

3 smoke tests across different verification tiers (Quick, Standard, Full)
25 golden test set claims (GTS-A) with documented ground truth from established fact-checkers or primary sources, covering all 8 specialist domains and 7 of 9 verdict categories
12 single-vector adversarial claims (ADV-v1) targeting 9 gaming attack patterns, each using real institutions, real phenomena, and plausible statistics

Phase 2: Adversarial stress testing (12 claims)

12 multi-vector adversarial claims (ADV-v2) - each combining 2-3 gaming vectors simultaneously
4 claims based on documented real-world disinformation patterns (VAERS misuse, “died suddenly” narrative, immigration-crime statistics, FEMA hurricane diversion)
2 methodology stress tests (true-facts-false-composite, fabricated citation)
4 claims requiring consultation of the Institutional Reliability Index
2 blocking claims testing the most common real-world attack patterns against public health fact-checking

Phase 3: Gap-filling and edge cases (45 claims)

25 weakness-targeting claims (GTS-B): verdict boundaries, non-Western contexts, statistical manipulation, predictive claims, breaking events, AI-generated content, definitional disputes
20 gap-filling claims (GTS-C): standalone LACKS CONTEXT, expanded MOSTLY TRUE coverage, non-English source evaluation (Japanese, Turkish, Chinese, Hindi), institutional capture scenarios, genuinely contested ground truth

Scoring criteria

Each claim scored as Pass, Partial, or Fail:

Pass: Correct verdict, confidence within expected range, and (for adversarial claims) correct gaming flag detected
Partial: Correct verdict but confidence outside range, or correct boundary alternative, or gaming flag detected but verdict wrong
Fail: Wrong verdict (not the expected boundary alternative), or both gaming flag missed and verdict wrong

Results

Overall

Test Suite	Claims	Passed	Partial
Smoke Tests	3	3	0
Golden Test Set A	25	25	0
Adversarial Suite v1	12	12	0
Adversarial Suite v2	12	12	0
Golden Test Set B	25	24	1
Golden Test Set C	20	20	0
Total	97	96	1

By domain (GTS-A)

Domain	Claims	Passed
Scientific/Technical	4	4
Legal/Regulatory	3	3
Medical/Health	3	3
Financial/Economic	3	3
Electoral/Voting	3	3
Historical	3	3
Technology/Digital	3	3
Propaganda/General	3	3

Adversarial v1: By attack vector

Attack Vector	Claims	Gaming Detected	Verdict Correct
Confidence laundering	2	2	2
Citogenesis	2	2	2
Unverifiable-by-design	2	2	2
Tier inflation	1	1	1
Framing manipulation	1	1	1
Selective skepticism	1	1	1
Coordinated sourcing	1	1	1
Preprint pump-and-dump	1	1	1
Anchoring	1	1	1

Adversarial v2: Per claim

Claim	Attack Vectors	Expected	Produced	Confidence	Primary Flag	Result
ADV-013: EPA Emissions Data	data-disappearance, framing	MOSTLY FALSE	MOSTLY FALSE	82%	Detected	Pass
ADV-014: USDA Food Insecurity	data-disappearance, selective-skepticism	MISLEADING	MISLEADING	80%	Detected	Pass
ADV-015: CDC Vaccine Guidance	institutional-capture, anchoring	MOSTLY FALSE	MOSTLY FALSE	88%	Detected	Pass
ADV-016: NOAA Climate Research	institutional-capture, confidence-laundering	FALSE	FALSE	93%	Detected	Pass
ADV-017: Athlete Cardiac Events	framing, anchoring, selective-skepticism	MOSTLY FALSE	MOSTLY FALSE	90%	Detected	Pass
ADV-018: VAERS Misuse	confidence-laundering, citogenesis	FALSE	FALSE	95%	Detected	Pass
ADV-019: Immigration Crime Stats	framing, data-disappearance	MISLEADING	MISLEADING	85%	Detected	Pass
ADV-020: FEMA Hurricane Diversion	anchoring, confidence-laundering, coordinated	MOSTLY FALSE	MOSTLY FALSE	90%	Detected	Pass
ADV-021: IARC Processed Meat	framing, anchoring	MISLEADING	MISLEADING	92%	Detected	Pass
ADV-022: Fabricated Lancet Study	tier-inflation, confidence-laundering, anchoring	FALSE	FALSE	88%	Detected	Pass
ADV-023: Temperature Adjustments	framing, selective-skepticism, anchoring	FALSE	FALSE	95%	Detected	Pass
ADV-024: Great Reset Conspiracy	anchoring, framing, unverifiable-by-design	MOSTLY FALSE	MOSTLY FALSE	92%	Detected	Pass

Adversarial v2: Pass criteria

Criterion	Threshold	Actual
Claims PASS	≥8 of 12	12 of 12
PARTIAL limit	≤3	0
Blocking: ADV-015 (CDC institutional capture)	Must PASS	PASS
Blocking: ADV-018 (VAERS misuse)	Must PASS	PASS
Primary gaming flags	≥10 of 12	12 of 12
Total gaming flags	≥16 of ~30	39

GTS-B: By category

Category	Claims	Passed	Partial
Verdict Boundary Cases	5	5	0
Non-Western Context	5	4	1
Statistical Manipulation	5	5	0
Predictive Claims	3	3	0
Breaking Event Scenarios	3	3	0
AI-Generated Content	2	2	0
Definitional Disputes	2	2	0

The single partial (GTS-033, Gaza rebuilding video): correct verdict (FALSE) but confidence 80% versus expected 85-92% because the specific Misbar fact-check article was unavailable at test time, limiting sourcing to Tier 2. The methodology correctly applied its Tier 2 confidence ceiling. This reveals a source-availability limitation.

GTS-C: Gap coverage

Gap Targeted	Claims	Passed
LACKS CONTEXT standalone	5	5
MOSTLY TRUE expansion	4	4
Non-English source required	4	4
Institutional capture (IRI)	5	5
Genuinely contested ground truth	6	6

Strengths

Verdict accuracy: 96/97 correct across claims deliberately designed to be confusing, including 18 boundary cases, 24 adversarial scenarios, and 6 genuinely contested topics.

Boundary resolution: All 18 verdict boundary tests resolved to the expected side, including the Misleading/Lacks Context and Mixed/Mostly False distinctions.

Gaming detection under realistic conditions: All 24 adversarial attack vectors detected. The v2 suite detected 39 total flags against approximately 30 expected, including secondary and tertiary vectors.

Institutional Reliability Index: Correctly applied to override historically Tier 1 sources (EPA, USDA, CDC, NOAA) based on documented institutional degradation. Correctly not applied to historical scientific methodology that predates the degradation (ADV-023).

Wild-caught disinformation: 4 claims based on real-world patterns (VAERS misuse, “died suddenly,” immigration-crime stats, FEMA diversion) handled correctly through analytical process - not by matching against known debunked claims.

Contested ground truth: 6 claims on genuinely ambiguous topics (COVID-19 origins, learning loss projections, minimum wage effects, affirmative action outcomes, nuclear safety, Cochrane masking review) produced correct verdicts with appropriately wide confidence ranges.

Non-English evaluation: Claims requiring Japanese, Turkish, Chinese, and Hindi source evaluation all passed.

Limitations

Near-perfect results warrant scrutiny. The test suite was designed by the same people who built the methodology. While it expanded substantially in Phase 3, external validation - where neither the claims nor the expected results are designed by the methodology’s authors - would provide stronger evidence.

Validation by the methodology’s own implementation. The fact-checks were performed by AI following the Veridi methodology. This tests whether the methodology produces correct results when followed, but does not test whether human volunteers can follow it correctly. Usability testing is a separate and necessary step.

Adversarial claims were mostly constructed. The v2 suite improved on v1 by including 4 wild-caught patterns and requiring multi-vector detection, but even the wild-caught claims were adapted for testing rather than submitted verbatim.

Scale testing has not been conducted. The methodology has been validated on 97 claims but has not been used in continuous production at scale.

Brier score calibration is pending. The confidence calibration framework includes a Brier score tracking mechanism, but insufficient data points have accumulated for statistical significance.

We are sensitive to the fact that passing every single test may indicate a weakness in the test suite or in the validation criteria, rather than a strength in the system. If you know of, or can frame, a test that Veridi will fail, we welcome the challenge and look forward to learning from it.

v2.5 Regression Testing

Following the v2.5 methodology changes (verbal confidence bands, top-3 vector display format, Brier protocol redefinition, Canadian IRI entries), regression testing was conducted to verify that the changes did not degrade existing performance.

Phase 2 regression (format changes)

11 claims were re-run to verify that the new output format (verbal bands, top-3 vector display) did not alter verdicts or gaming detection.

Test	Claims	Passed	Partial	Failed
Format regression	11	11	0	0

Phase 5 regression (pipeline + targeted)

49 claims were tested covering pipeline integration (Veridi to Pragma to Praxis) and targeted regression on confidence presentation and gaming display changes.

Test	Claims	Passed	Partial	Failed
Pipeline + targeted regression	49	49	0	0

Cumulative post-v2.5

60 regression claims tested after v2.5 changes, all PASS. Combined with the original 97-claim validation suite, the methodology has been tested against 157 claims total with 156 PASS, 1 PARTIAL, 0 FAIL.

Full per-claim scorecards, evidence summaries, decision tree paths, and gaming countermeasure analyses are available in the methodology files.