Reviewer Agreement and Track-Record Signals

Why this page exists

A fact-checking system that refuses some submissions has to be transparent about two things: (1) how reliable its refusals are, and (2) how it handles users whose submission history looks suspicious. This page covers both. Veridi measures the first with a reviewer-pair agreement protocol scored by Krippendorff’s α. Veridi handles the second with a per-user track-record signal that informs review priority but, by structural design, never relaxes the assessment process for any user regardless of how clean their record looks.


What gets reviewed

When a submission produces an ATTACK-DETECTED or REFUSED-TOPIC verdict, the assessment is preserved as a rejection event. Operators (admin users) can later classify the event with one of four tags:

TagMeaning
intake-false-positiveThe refusal was the wrong call. The submission was legitimate, and a clean assessment should have run.
bypass-approvedThe refusal was technically correct for the input shape, but an operator override is appropriate for this case (for example, a journalist fact-checking adversarial content).
truly-harmfulThe refusal correctly caught harm-shaped intake.
truly-maliciousThe refusal correctly caught system-attack intake.

These four tags are the substrate for everything that follows. Each rejection event accumulates reviewer tags over time, and the methodology measures how reliably operators agree.


The K=2 reviewer protocol

Every active rejection event is tagged by up to two reviewers (K=2). The protocol:

  1. One reviewer tags first. The event sits in an “awaiting second review” state until another operator independently tags it.
  2. A second reviewer tags. If both reviewers chose the same tag, the event is “tagged, agreed” and the consensus stands.
  3. Disagreement triggers adjudication. If the two tags differ, the event enters “awaiting adjudication”. A third operator (the adjudicator) reviews both reviewer tags and the underlying submission, then issues a binding decision that supersedes the prior two tags.
  4. Withdraw and re-tag. A reviewer who wants to revise their tag uses a soft-withdraw path; the withdrawn row no longer counts against the K=2 quota.

K=2 is the smallest pair that can produce a disagreement signal. Larger panels would tighten the agreement statistic but cost more operator time per event; the protocol allocates the operator time saved into reviewing more events instead.


How reliably reviewers agree: Krippendorff’s α

Reviewer agreement is a measurable property of the rejection-event corpus, not a process input. The methodology computes Krippendorff’s α weekly over the trailing 90-day corpus, with two distance functions:

  • Ordinal δ (substantive agreement). The four tags carry a natural ordering by adversarialness of intake: intake-false-positive < bypass-approved < truly-harmful < truly-malicious. Ordinal δ scores a one-step disagreement as less severe than a two-step disagreement. Two reviewers who one tagged truly-harmful and another tagged truly-malicious are closer than two reviewers who one tagged intake-false-positive and another tagged truly-malicious.
  • Nominal δ (categorical agreement). A cross-check that treats the four tags as unordered. Useful when one wants to know “did the reviewers pick the same category at all?” without weighting by severity. Reported alongside ordinal δ as a redundancy.

Reporting both is deliberate. Ordinal δ is the primary signal for the methodology because the tag ordering is the substantive content; nominal δ catches the case where reviewers cluster on one severity level for the wrong reason and the ordinal calculation flatters the result.


The agreement floors

Krippendorff’s α takes values from −1 (perfect disagreement) through 0 (no signal) to 1 (perfect agreement). Methodology research generally treats α ≥ 0.667 as “tentative reliability” and α ≥ 0.800 as “publication reliability.” Veridi uses two floors per δ, four in total:

FloorOrdinal δNominal δ
Working reliability0.600.75
Publication reliability0.750.85

These floors are looser than verdict-α floors elsewhere in the methodology, intentionally: tagging an adversarial intake is genuinely harder than tagging the verdict on a regular claim. The looser floors are a calibration to the harder task, not a quality concession.

When any floor is breached on a given week, the snapshot is flagged for attention until an operator acknowledges it. Continued breaches feed methodology review, which may revise the tag definitions, the reviewer training, or the threshold itself.


The intake-false-positive rate

Alongside α, the methodology tracks the share of recent refusals that were tagged intake-false-positive: a refusal that, in reviewer judgment, should not have fired. Computed over the trailing 30 days, this rate is a direct calibration on the adversarial pre-filter and the harm-topic refusal gate.

A floor applies: if the 30-day intake-false-positive rate exceeds 15%, the adversarial pre-filter has drifted out of calibration and the next methodology review treats it as priority work. The floor is loose enough that a quiet week with two false positives out of four refusals doesn’t trip the alarm, but tight enough that sustained over-refusal cannot accumulate unflagged.


Per-user track-record signal

Separately from the rejection-event corpus, the methodology measures per-user adversarial activity as a continuous signal. Two counters are maintained per user over a rolling 90-day window:

  • adversarial_event_count: submissions producing ATTACK-DETECTED, REFUSED-TOPIC, or tagged truly-harmful / truly-malicious after reviewer agreement.
  • legitimate_claim_count: submissions producing any directional verdict on the regular twelve-category scale.

The ratio adversarial / (adversarial + legitimate) is the per-user track-record. A user with five legitimate submissions and zero adversarial events has a ratio of 0; a user with five adversarial events and zero legitimate has a ratio of 1.

What the signal is used for

The track-record informs review priority and surfacing: an operator looking at the rejection queue sees the per-user ratio alongside the rejection event. A user with no prior submissions and one adversarial event looks different from a user with fifty prior adversarial events. The track-record helps the operator allocate attention.

What the signal is structurally prevented from doing

The track-record never short-circuits the assessment process for any user. This is the bypass-precondition principle: no per-user history value, alone, is treated as a sufficient condition for relaxing the assessment of any specific submission. A user with a perfect record gets the same scrutiny on every submission as a brand-new account. A user with a long adversarial record cannot have a submission rejected without the same assessment path that any other submission goes through.

This is the structural defense against vector #13 (Warm-up-then-defect (per-user trust gaming)): an adversary cannot accumulate trust on the system and then exploit it, because trust is not a system input. The signal is informational at the operator-review surface; it does not feed back into the assessor.


How findings feed methodology revision

The reviewer-agreement and intake-false-positive floors are designed to flag drift, not to trigger automatic methodology changes. When a snapshot breaches a floor:

  1. The snapshot row carries a requires_attention flag visible in the admin cadence-review queue.
  2. An operator acknowledges the snapshot, which is itself recorded for audit.
  3. The methodology maintainer reviews the breached snapshot and decides what changes (if any) to make in the next methodology revision: tag-definition tightening, reviewer-training updates, threshold adjustments, or no change with documented rationale.

The methodology files are never auto-modified. The loop is auto-flag with operator acknowledgement, not auto-adjust.


Reading the source

The protocol is canonical in Output_Format_Standard.md §“Rejection-event taxonomy” and operationalized in Regression_Testing_Framework.md §5d. The runtime implementation lives in Veridi/app/database.py (the K=2 storage model with partial unique index, the ordinal-α computation, the trailing-30-day false-positive rate) and Veridi/app/main.py (the admin tagging routes and snapshot acknowledgement endpoints).