Calibration feedback loop

How recommendations get held accountable

A recommendation system that doesn’t track its own track record can’t improve. Tetlock’s two decades of expert-political-judgment research established that forecast quality improves when forecasters are held accountable to their own prior calibration. Mellers et al. (2014) found that training, teaming, and tracking each contributed independently to calibration improvement in the Good Judgment Project — none of the three was substitutable.

Veridi makes confidence-calibrated claims. Pragma issues policy recommendations with explicit confidence bands. Praxis assigns leverage-confidence bands to recommended pathways. Without outcome tracking, all three are uncalibrated by construction: there is no closed loop connecting the confidence assignment at time T to what happened at time T+N.

The calibration feedback loop closes that loop.

What the loop computes

Submitters who consent to outcome tracking can report what actually happened at intervals after their original submission: 1 week, 1 month, 6 months, 1 year (with P9 litigation extending to 3 years for slow dockets). The schemas vary by methodology and pathway, but every report maps to a numeric signal:

Veridi: verified (the verdict held up) maps to 1.0; falsified maps to 0.0; uncertain is excluded from scoring.
Pragma: adopted maps to 1.0; partial_adoption to 0.5; rejected and reversed to 0.0; stalled is excluded.
Praxis: action_taken / sustained_engagement / leverage_realized map to 1.0; harm_experienced maps to 0.0.

From these signals the system computes Brier-lite scores: per pathway × issue-category for Praxis, per recommendation × jurisdiction-category for Pragma, methodology-wide for Veridi. Brier is the canonical forecaster-evaluation metric (Tetlock & Gardner 2015) — squared error between predicted probability and realized outcome. Lower is better-calibrated.

The live calibration page surfaces 30/60/90 day windows per methodology. When the longest window has fewer than five resolvable outcomes, the panel shows “calibration loop not yet running for this methodology” and waits for signal to accumulate.

What fires a flag

Three detection passes run on the aggregated outcome data:

Brier drift (Praxis). Per pathway × issue-category cell with N≥50 outcomes: fires if observed-rate vs. pathway leverage ceiling deviates by ≥0.15 absolute, OR if cell Brier vs. methodology baseline deviates by ≥0.10 absolute.

High harm rate (Praxis). Per pathway: fires if harm_experienced rate exceeds the pathway’s risk-class ceiling per Praxis_Sustainability_Risk.md — low-risk pathways (P1, P4, P5) at 5%, medium-risk (P2, P3, P6, P7) at 15%, high-risk (P8 Direct Action, P9 Litigation) at 30%.

Pragma drift. Methodology-wide: fires if mean_actual vs. mean_predicted exceeds 0.15 absolute.

When a pass fires, a row is inserted into the review queue. Detection is idempotent — the same key won’t generate duplicate open flags.

The critical discipline: auto-flag, NOT auto-adjust

When a flag fires, a methodology maintainer reviews the queue at /admin/calibration-flags, picks a decision (raise_ceiling / lower_ceiling / add_modifier / no_action), and the decision is logged. The methodology files (Praxis_Leverage_Matching.md, Pragma_Confidence_Calibration.md, etc.) are then edited by a human in the next methodology revision.

The system never modifies methodology files programmatically. This is load-bearing.

Praxis_Outcome_Tracking.md §5(a) gives the rationale: “Review, don’t auto-adjust — calibration drift from small N or reporting bias is a risk.” Three reasons make auto-adjustment unsafe even at large N:

Reporting bias. Outcome submission is voluntary; cohorts most willing to report skew positive (or negative) and drift signals from them are not population-representative.
Reverse causation. A pathway’s observed action rate may exceed its ceiling because the system correctly recommended a high-leverage strategy; auto-lowering the ceiling on that signal would degrade the methodology.
Methodology coherence. Pathway ceilings, risk-class assumptions, and gaming countermeasures form an interdependent system. A change to one ceiling has implications for the others (for example, the Pragma ↔ Praxis confidence inheritance rule). A methodology maintainer reasons about the whole; a detection script does not.

Outcome reporting is opt-in. Without explicit consent at recommendation-delivery time, no tracking occurs. Per-claim opt-out is always available via the same toggle.

All free-text fields in outcome submissions pass through a server-side anonymizer that strips emails, phone numbers, street addresses, and role + employer phrases before aggregation. The raw payload is access-controlled; aggregation queries see only the anonymized form.

K-anonymity floors prevent admin views from exposing per-cell data below threshold. The default is k=10. Sensitive Praxis pathways carrying individual identifiable harm exposure (P3 Professional Leverage, P8 Direct Action, P9 Litigation) default to k=20 per Praxis_Outcome_Tracking.md §4. Both thresholds are tunable by administrators.

What ships, what’s deferred

Shipped in v1.3:

Brier-lite scoring for all three methodologies
Threshold detection across the three passes above
Admin review queue with decision logging
30/60/90 day calibration surface on the live calibration page
Per-pathway-sensitivity k-anonymity

Deferred to v1.4:

Burnout signal detection (requires per-outcome engagement-level capture not present in the current schema)
LLM-based issue-category classifier (current taxonomy is keyword-matched)
Outcome → methodology-edit workflow tooling (review-decision → PR generation against the methodology repo). For now the maintainer edits methodology files manually after consulting the review queue.

How recommendations get held accountable

What the loop computes

What fires a flag

The critical discipline: auto-flag, NOT auto-adjust

Anonymity and consent

What ships, what’s deferred

Cross-references