How Pragma Evaluates Evidence
Three dimensions, not one
Most evidence hierarchies rank studies on a single ladder: observational at the bottom, meta-analyses at the top. This conflates three independent questions that need separate answers.
Where was it published? A meta-analysis in a predatory journal is not the same as a meta-analysis in the Cochrane Library. A government statistical agency under political pressure produces different-quality data than one operating with independence. Source quality matters, and it’s assessed independently from methodology.
What methodology was used? An observational study in The Lancet is not the same as a randomized controlled trial in a working paper. Study design determines the ceiling on causal claims. But a strong observational design published in a top journal can outperform a weak RCT with high attrition and limited external validity.
How stable is the field? A well-designed RCT in nutrition science faces different replication risks than the same design in physics. Field reliability tells you how much to trust that today’s findings will hold up. It doesn’t mechanically reduce confidence in a specific strong study, but it’s information you should have.
Pragma assesses all three independently and combines them through explicit interaction rules.
Source quality: 4 tiers
Adapted from Veridi’s source hierarchy for policy-relevant research:
Tier 1 - Primary/Authoritative: Peer-reviewed journals, official statistical agencies (when operating with independence), systematic review organizations like Cochrane, government records and legislation, raw survey microdata.
Tier 2 - Established Secondary: Research institutions with published methodology (NBER, Brookings, RAND, Urban Institute), government analytical agencies (CBO, GAO), international organization research divisions (OECD, WHO research departments), pre-registered analysis plans.
Tier 3 - Contextual/Qualified: Advocacy research with disclosed methodology, government sources showing institutional degradation, working papers not yet peer-reviewed, think tanks with known orientation. Usable but requires cross-referencing with higher tiers.
Tier 4 - Use with Extreme Caution: Sources that should not anchor policy recommendations without independent corroboration from higher tiers.
Source quality sets a structural ceiling on confidence. Tier 3 sources cannot support high-confidence recommendations regardless of what their analysis claims to show.
Study design: 6 levels with quality modifiers
Each level reflects the strength of causal inference the methodology can support:
| Level | Design | Maximum Confidence (Strong) | Maximum Confidence (Weak) |
|---|---|---|---|
| 6 | Systematic Review / Meta-analysis | 85% | 65% |
| 5 | Policy Implementation Evidence (at scale) | 80% | 65% |
| 4 | Experimental (RCT, field experiments) | 75% | 55% |
| 3 | Quasi-experimental (DiD, IV, RD, synthetic control) | 60% | 45% |
| 2 | Longitudinal / Panel (controlled) | 50% | 35% |
| 1 | Observational / Cross-sectional | 35% | 25% |
Every study gets a strong or weak execution assessment based on statistical power, pre-registration, appropriate controls, transparent methodology, and whether independent replication exists.
The final structural ceiling is the minimum of source quality and study design. A Tier 2 source (80% ceiling) with a Level 3 study design (60% ceiling, strong) produces a 60% ceiling. The weaker dimension binds. This prevents volume from higher-tier sources from compensating for weak methodology, and prevents strong methodology from compensating for questionable sourcing.
The Level 3 problem: identification strategy assessment
This is the most important innovation in Pragma’s evidence evaluation. Quasi-experimental designs - the workhorse of policy evaluation - derive their causal claims from identification assumptions. A difference-in-differences study assumes parallel trends. An instrumental variables study assumes exclusion restriction. A regression discontinuity assumes continuity at the threshold.
When those assumptions hold, these designs produce credible causal evidence. When they don’t, the study provides no more causal evidence than an observational correlation, regardless of its sample size or statistical sophistication.
Pragma makes this assessment explicit. For every Level 3 study, the identification strategy is named, the assumption is stated, and the credibility is assessed:
| Credibility | Modifier | Meaning |
|---|---|---|
| Strong | 0.90-1.00 | Assumption supported by diagnostics, no credible published challenge |
| Moderate | 0.70-0.89 | Assumption plausible, diagnostics partially supportive |
| Weak | 0.50-0.69 | Assumption questionable, diagnostics fail or suggest violation |
The modifier is applied to the base study design ceiling. A Level 3 study with a 60% ceiling and a Moderate credibility modifier (0.75) produces an adjusted ceiling of 45%. If the identification assumption is shaky, the study can’t support strong policy conclusions no matter how large the dataset.
When multiple quasi-experimental studies using different identification strategies reach the same conclusion, that convergence is stronger than replication within a single strategy - because different strategies have different failure modes. When they reach conflicting conclusions, at least one assumption is violated, and the conflict itself is informative.
Evidence directness
Does the evidence directly address the policy question being asked, or is it indirect?
Direct: The study examines the specific policy, population, and outcome in question. A study of rent control’s effect on housing supply in cities similar to the target jurisdiction directly addresses “Should we implement rent control?”
Partially indirect: The study addresses a related but not identical question. A study of price controls in agricultural markets is partially indirect evidence for rent control.
Indirect: The study addresses a different question whose relevance depends on theoretical reasoning. General equilibrium models of price distortions are indirect evidence.
Indirect evidence reduces the study design ceiling. This prevents Pragma from building confident recommendations on chains of theoretical reasoning that happen to originate from strong studies of different questions.
Transferability: 7 dimensions
Strong evidence from Context A does not automatically apply to Context B. Pragma assesses transferability across seven dimensions:
- Population match - Does the target population resemble the study population in relevant respects?
- Institutional match - Does the target jurisdiction have similar government capacity, administrative competence, and institutional structure?
- Economic match - GDP, inequality, labor market characteristics, existing infrastructure.
- Cultural/social match - Trust levels, social cohesion, civic participation norms.
- Scale match - City-level pilots may not generalize to national implementation. Small homogeneous nations may not generalize to large diverse ones.
- Temporal match - Evidence from 1980s Sweden may not transfer to 2026 anywhere.
- Constitutional/legal match - Some interventions require constitutional structures that don’t exist in the target jurisdiction. This is a hard constraint, not a soft one.
Each dimension is rated Strong, Moderate, Weak, or Unknown. The overall transferability score is High, Moderate, Low, or Indeterminate - and a Level 4 RCT with Low transferability supports no more than Speculative confidence for the target context. The study is still valid for its original context; it just doesn’t travel.
Implementation gap
The gap between the policy as the evidence describes it and the policy as it could actually be implemented in the target context. Rated Minimal, Moderate, Substantial, or Prohibitive.
When the gap is Substantial or Prohibitive, recommendation confidence drops by one level. A policy with strong evidence but a Prohibitive implementation gap - where the implementable version bears little resemblance to the evidence-supported version - cannot receive high confidence regardless of the underlying research quality.
Confidence bands
Pragma uses five confidence levels plus three special assessments:
| Level | Meaning |
|---|---|
| High | Strong evidence, high transferability, mechanism understood, implementation precedent exists |
| Moderate-High | Good evidence base with some transferability or implementation uncertainty |
| Moderate | Reasonable evidence, plausible mechanism, some implementation evidence |
| Low | Suggestive evidence, transferability uncertain, mechanism contested |
| Speculative | Evidence base weak or inapplicable; recommended on theoretical grounds or analogy |
| Contested | Strong evidence on both sides; the primary dispute is about values, not facts |
| Inadvisable | Evidence of harm or strong evidence against effectiveness |
| Not Assessable | Insufficient evidence to make any recommendation |
These are structural estimates based on evidence quality, transferability, and implementation feasibility. They communicate uncertainty explicitly. A “Moderate” confidence recommendation is not “probably right” - it means the evidence base has specific, documented limitations that prevent higher confidence, and the recommendation would change if those limitations were resolved differently.