Guide

The Three‑Layer QA Wall Kit (rubric template + judge prompts + 10% HITL sampling SOP)

A plug‑and‑play guide to stand up a three‑layer QA wall for AI deliverables: a weighted rubric with acceptance rules, copy‑ready judge prompts (rubric and pairwise), a 5–10% human sampling SOP, and a Monday dashboard you can run from café Wi‑Fi.

Ship a QA wall you can run from café Wi‑Fi: 1) a rubric‑scored LLM judge on every job; 2) a weekly golden‑set replay with bias/drift checks; 3) a 5–10% human‑in‑the‑loop sample with red/amber/green thresholds and a Monday dashboard. Copy the templates below, set the thresholds, and review once a week for 30 minutes.

Layer 1 — Rubric template and acceptance rules

Use this rubric to score any AI‑assisted deliverable (content, support replies, agent actions). Scores are 0–5 per criterion with explicit weights. A weighted total ≥0.80 with no critical risk flags ships as Green.

Rubric scale (apply to each criterion):

  • 5 Excellent — clear, correct, on‑spec; no edits.
  • 4 Good — minor nits; ships with light edit.
  • 3 Adequate — needs targeted edits to meet spec.
  • 2 Poor — multiple issues; rework required.
  • 1 Unacceptable — severe flaws; do not ship.
  • 0 N/A — criterion not applicable (weight ignored).

Copy/paste YAML (edit weights and examples to your domain):

rubric:
  version: 1.2
  scale: 0-5
  weights:
    task_fulfillment: 0.30     # followed instructions/constraints
    factual_accuracy: 0.25     # grounded, verifiable claims
    clarity_structure: 0.15    # organization, headings, legibility
    style_brand_fit: 0.10      # tone, voice, format compliance
    citations_support: 0.10    # sources/links when required
    safety_risk: -0.10         # subtract for any risk (see flags)
  criteria:
    task_fulfillment:
      examples_pos: ["All required fields present", "Respected token/length caps", "Returned valid JSON schema"]
      examples_neg: ["Ignored explicit constraint", "Missing mandatory section"]
    factual_accuracy:
      examples_pos: ["Matches provided source data", "Numbers trace to links"]
      examples_neg: ["Hallucinated product features", "Misquoted stats"]
    clarity_structure:
      examples_pos: ["Uses headings/bullets", "Self‑contained summary"]
      examples_neg: ["Wall of text", "Inconsistent formatting"]
    style_brand_fit:
      examples_pos: ["Plainspoken operator tone", "No hype/jargon"]
      examples_neg: ["Salesy language", "Off‑brand emojis"]
    citations_support:
      examples_pos: ["Inline links for facts", "Source list at end"]
      examples_neg: ["""According to experts" with no source", "Dead links"]
    safety_risk:
      flags_critical: ["PII leakage", "Medical/legal advice without disclaimer", "Security vulnerability disclosure", "Hate/abuse"]
      flags_minor: ["Speculative claim as fact", "Unclear sourcing"]
  acceptance:
    green: "weighted_total >= 0.80 AND no flags_critical AND min(task_fulfillment, factual_accuracy) >= 4"
    amber: "0.70 <= weighted_total < 0.80 OR any single criterion <= 2 OR flags_minor present"
    red:   "weighted_total < 0.70 OR any flags_critical OR JSON invalid/unsafe action"

Judge output schema (require the model to return exactly this JSON):

{
  "version": "1.2",
  "item_id": "[UUID]",
  "rubric_scores": {
    "task_fulfillment": {"score": 0-5, "notes": "..."},
    "factual_accuracy": {"score": 0-5, "notes": "..."},
    "clarity_structure": {"score": 0-5, "notes": "..."},
    "style_brand_fit": {"score": 0-5, "notes": "..."},
    "citations_support": {"score": 0-5, "notes": "..."}
  },
  "risk_flags": [{"level": "minor|critical", "tag": "PII|Hallucination|Policy|Other", "notes": "..."}],
  "weighted_total": 0.0,
  "decision": "green|amber|red",
  "judge_confidence": 0.0,
  "timing_ms": 0,
  "tokens": {"in": 0, "out": 0}
}

Green ships automatically. Amber routes to light human edit. Red blocks and escalates.

Judge prompt pack — rubric scoring + pairwise calibration

Run two judge modes: rubric‑scored (pointwise) for every job and pairwise for calibration/borderlines. Always randomize candidate positions and, when feasible, evaluate both orders (A/B and B/A) to kill position bias.

System message (rubric‑scored):

You are a strict QA judge for [OUTPUT_TYPE] using a weighted rubric. Follow the rubric YAML exactly. Return only the required JSON schema; no extra prose. Penalize unsupported claims. If any critical risk flag appears, decision must be "red" regardless of score.

User template (rubric‑scored):

[INSTRUCTIONS]

Rubric YAML:
[rubric_yaml_here]

Candidate OUTPUT:
[model_output_here]

Return the JSON schema exactly.

System message (pairwise):

You are a calibration judge. Compare two candidates for the same prompt. Ignore surface style unless it affects rubric outcomes. Return JSON with: preference ("A"|"B"|"tie"), reasons (≤40 words), and per‑criterion win/loss.

User template (pairwise with position randomization):

Prompt:
[prompt_text]

Rubric YAML:
[rubric_yaml_minified]

Candidate A (order=[A_first_or_second]):
[output_text_A]

Candidate B (order=[B_first_or_second]):
[output_text_B]

Return JSON: {"preference":"A|B|tie","reasons":"...","criteria": {"task_fulfillment":"A|B|tie", "factual_accuracy":"A|B|tie", "clarity_structure":"A|B|tie"}}

Implementation notes:

  • Run both orders when time allows: AB and BA; aggregate with Bradley–Terry or simple majority for quick checks.
  • Enforce max‑tokens and JSON schema via tool/function calling to keep latency/cost predictable.
  • Never let the same model family judge its own outputs on high‑risk work; use a different provider or version.

Layer 2 — Golden‑set replay and weekly drift check

Lock a small, fixed golden set per output type (e.g., 40–60 items) with human‑agreed labels and notes. Re‑evaluate it weekly and any time you change models, prompts, or data.

Golden‑set build (once):

  1. Sample real jobs across difficulty, length, and risk categories.
  2. Get 2–3 human raters to score with the rubric; resolve disagreements; store canonical labels and rationales.
  3. Save canonical inputs, expected properties, and any required sources.

Weekly replay (30–60 minutes):

  1. Run rubric‑judge on the golden set; log weighted_total, decision, risk flags.
  2. Run pairwise on 10–20 borderline pairs to calibrate preference stability (AB and BA orders).
  3. Compute agreement metrics vs canonical labels (e.g., Cohen’s κ or Kendall’s τ). Track week‑over‑week deltas.
  4. If agreement drops below “substantial” (κ < 0.61 or τ < 0.59) or flips on known tricky items, investigate and adjust.
  5. Refit thresholds if needed: raise/lower the green cut‑line by ±0.02 to maintain desired precision/recall.

Bias/drift controls:

  • Always randomize candidate order; when feasible, run both orders and aggregate.
  • Store judge version, model ID, and prompt hash to spot silent regressions.
  • Keep the golden set stable for 4–6 weeks; then rotate ≤15% of items to avoid overfitting.

Lisbon Test (travel‑proof):

  • Budget: 60 golden items × ~$0.02 ≈ $1.20/week; time ≈ 15–20 minutes per judge run on café Wi‑Fi.
  • If replay fails (timeouts or big deltas), pause auto‑ship and move to Amber sampling until stability returns.

Layer 3 — 5–10% human‑in‑the‑loop sampling SOP

Review a small, smart slice of work each week. Default 5–10% of total jobs, stratified to maximize signal.

Sampling recipe (per week):

  • 50%: All Amber decisions (borderlines).
  • 30%: High‑risk Greens (long outputs, safety‑sensitive domains, new client/style).
  • 20%: Pure random Greens.

Human review protocol:

  1. Human scores the item with the same rubric and leaves concrete edit notes.
  2. Log human vs judge decision and per‑criterion gaps.
  3. If a human finds a critical miss (e.g., major hallucination), immediately flip that job to Red and run a targeted audit.

Red/Amber/Green thresholds (team‑level):

  • Green (ship): judge_precision_on_green ≥ 0.95 AND human_disagreement_on_green ≤ 10% AND no critical flags this week.
  • Amber (hold + edit): either metric above misses target OR κ/τ agreement dips below substantial.
  • Red (stop + escalate): any critical risk event OR ≥2 major misses in a 50‑item sample OR golden‑set agreement crash (κ < 0.50 or τ < 0.50).

Escalation playbook:

  • Borderline surge (Amber > 25% of volume): raise green cut‑line by +0.02 and increase human sample to 15% for 1 week.
  • Drift suspected: freeze auto‑ship on the affected output type; re‑run golden set; switch judge model/provider if needed.
  • Safety incident: block category, alert owner, require 2‑person human review for 24 hours.

Expected outcomes:

  • Maintain trust without full‑time reviewers.
  • Create auditable artifacts clients can inspect asynchronously.

Monday QA dashboard — metrics and thresholds

Stand up a simple table (Notion, Airtable, or a lightweight dashboard) that refreshes weekly. Columns and logic below.

Core widgets:

  • Volume & mix: total items, %Green/%Amber/%Red.
  • Judge health: κ or τ vs golden set; 4‑week trend sparkline.
  • Human QA: judge_precision_on_green (% of Greens humans agree should ship), human_disagreement_on_green (% of Greens humans downgraded), sample_n.
  • Risk: count by risk_flag (PII, Hallucination, Policy), MTTR on escalations.
  • Cost/latency: avg tokens_in/out, $/eval, median seconds/eval.

R/A/G for the dashboard (week‑level):

  • Green: κ or τ ≥ 0.61; judge_precision_on_green ≥ 0.95; 0 critical flags; MTTR < 24h.
  • Amber: κ or τ in [0.50, 0.60] OR judge_precision_on_green in [0.90, 0.94] OR ≥1 minor flag cluster.
  • Red: κ or τ < 0.50 OR ≥1 critical flag OR judge_precision_on_green < 0.90.

Owner checklist for Monday review (30 minutes):

  1. Scan R/A/G state. If Amber/Red, apply the escalation playbook.
  2. Open 3 downgraded Greens; add “pattern” notes to the rubric.
  3. Check latency/cost outliers; cap max tokens or tighten prompts if drifting.
  4. Log one improvement action and a due date.

Cost and latency — plan your spend (with defaults you can trust)

Judge costs are tiny compared to human review. Use the formula below to plan budget and set sensible token caps.

Cost formula (o3‑mini pricing as of May 2026):

  • Input: $1.10 per 1M tokens → $0.0000011/token
  • Output: $4.40 per 1M tokens → $0.0000044/token
  • Per‑eval cost ≈ input_tokens×0.0000011 + output_tokens×0.0000044

Quick scenarios:

  • Compact rubric JSON (700 in / 120 out): ~$0.0013 per eval (~0.13¢)
  • Heavier evidence check (2,500 in / 300 out): ~$0.0067 per eval (~0.67¢)
  • Conservative planning envelope: $0.01–$0.03 per eval covers most ops; clinical slide example shows ≈$0.02 and ≈16 s per eval.

Budget examples:

  • 1,000 items/week at $0.02 ≈ $20 and ~4–5 judge‑hours total latency (runs in parallel).
  • Human at $50/10 min each → ≈$50,000 and 166 hours for the same volume.

Latency tips:

  • Keep prompts/rubrics minified; enforce JSON‑only output.
  • Cap max output tokens; avoid long rationales (≤40 words).

24‑hour implementation plan — from blank to live

Follow this sequence to ship the wall in a day.

  1. Prep (15 min)
  • Paste the rubric YAML and JSON schema into your repo/wiki. Set Green/Amber/Red cut‑lines.
  1. Judge on every job (60–90 min)
  • Add a judge step to your pipeline (after generation, before delivery).
  • Enforce JSON‑only responses and validate against the schema.
  • Auto‑ship Greens; route Ambers to a light editor queue; block Reds.
  1. Golden set (45–60 min)
  • Collect 40–60 representative items with human‑agreed labels.
  • Script a weekly replay job (cron) that logs agreement metrics and diffs.
  1. HITL sample (30–45 min)
  • Create a “QA Sample” view that pulls 5–10% per the recipe (Amber, high‑risk Green, random Green).
  • Add a one‑click form for human scores/notes; store reviewer/time.
  1. Dashboard (30 min)
  • Build a Notion/Airtable view with the widgets listed; color the week using R/A/G rules.
  1. Guardrails (15 min)
  • Randomize pairwise order; prefer AB+BA for calibration.
  • Use a different model/provider for judging high‑risk outputs.
  • Log model IDs, prompt hashes, tokens, and timing.

Definition of done

  • One week of runs with a Monday review on the calendar.
  • Owner named for QA wall and escalation decisions.

Copy‑ready prompts — drop‑in blocks for your pipeline

Drop these into your judge calls. Replace bracketed fields with your specifics.

Rubric‑scored judge (JSON‑only):

System: You are a strict QA judge for [OUTPUT_TYPE]. Follow the rubric. Return JSON only.
User:
TASK: [paste the original instructions/user prompt]
CONTEXT: [sources or grounding data, if any]
RUBRIC_YAML:
[rubric_yaml_here]
CANDIDATE_OUTPUT:
[model_output_here]
Return JSON per schema. If any critical risk flag appears, set decision=&quot;red&quot;.

Pairwise judge (calibration/borderlines):

System: You compare two candidates for the same task under the rubric. Ignore style unless it changes rubric outcomes.
User:
PROMPT:
[prompt_text]
RUBRIC_YAML_MIN:
[rubric_yaml_minified]
CANDIDATE_A (order=[A_first_or_second]):
[text_A]
CANDIDATE_B (order=[B_first_or_second]):
[text_B]
Return JSON: {&quot;preference&quot;:&quot;A|B|tie&quot;,&quot;reasons&quot;:&quot;≤40 words&quot;,&quot;criteria&quot;:{&quot;task_fulfillment&quot;:&quot;A|B|tie&quot;,&quot;factual_accuracy&quot;:&quot;A|B|tie&quot;,&quot;clarity_structure&quot;:&quot;A|B|tie&quot;}}

Automation notes:

  • Store order_id and run AB then BA for the same pair when time permits; aggregate with BT or majority.
  • For safety‑sensitive domains, require a human check on all Reds and any Green with a minor safety flag.