Minimal Evals Loop Starter Kit (Golden Sets + AI‑Judge + Pairwise A/B)
A copy‑and‑ship template to stand up a minimal LLM evals loop: JSONL golden sets (email rewrite, JSON extraction, summarization), GEval‑style judges, a blind pairwise A/B judge, a promptfoo CI regression gate, a Monday Notion scorecard, and a deprecations checklist for drift control.
Copy this into your repo or Notion and fill in the [BRACKETS]. The kit gives you three shippable eval types (email rewrite, JSON extraction, summarization), GEval‑style judge prompts, a pairwise A/B judge, a CI regression gate, and a Monday scorecard. Keep the generator and judge in different model families, randomize A/B order, and human‑review 10–20% of live jobs weekly.
Repo scaffold (copy/paste)
Drop this structure anywhere in your app repo or a separate /t/evals-loop folder. Keep datasets in JSONL for easy tooling interop.
Folder tree
/t/evals-loop/
datasets/
email_rewrite.jsonl
json_extraction.jsonl
summarization.jsonl
schemas/
extraction.schema.json
judges/
rubric_email.md
rubric_summarization.md
pairwise_ab.md
ci/
promptfoo/promptfooconfig.yaml
github-actions.yaml
scripts/
compute_metrics.py
sample_human_review.md
notion/
Weekly Scorecard (properties).md
CHECKLIST_deprecations.md
README.md
Golden set: Email rewrite (email_rewrite.jsonl)
Each line is one test case. Use small, representative, and a bit adversarial data. Tag risky patterns (tone, jargon, redaction) to slice later.
{"id":"[CASE_ID]","input_email":"[PASTE_RAW_EMAIL]","instruction":"Rewrite to [TONE/TASK] in ≤[N] words; keep [CONSTRAINTS]","expected":"[TARGET_REWRITE_OR_REFERENCE]","tags":["tone:[CASUAL/FORMAL]","region:[US/EU]","pii:[YES/NO]"]}
{"id":"[CASE_ID]","input_email":"...","instruction":"...","expected":"...","tags":["..."]}
Pass criteria example (for judges below):
- Instruction‑following ≥ [THRESHOLD_1] (e.g., 0.8)
- Tone fit ≥ [THRESHOLD_2] (e.g., 0.75)
- Clarity ≥ [THRESHOLD_3]
- No policy/PII leaks (hard fail)
Golden set: JSON extraction (json_extraction.jsonl + schemas/extraction.schema.json)
Define a single JSON Schema for validation, then pair each input with the expected object. Use EM/F1 per field for softer checks (dates, names).
Schema + example (copy as‑is, then edit)
extraction.schema.json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://[YOUR_DOMAIN]/schemas/extraction.schema.json",
"title": "[ENTITY] Extraction",
"type": "object",
"required": ["name","email","amount","currency"],
"properties": {
"name": {"type": "string", "minLength": 1},
"email": {"type": "string", "format": "email"},
"amount": {"type": "number", "minimum": 0},
"currency": {"type": "string", "enum": ["USD","EUR","GBP"]},
"invoice_date": {"type": "string", "format": "date"}
},
"additionalProperties": false
}
json_extraction.jsonl
{"id":"[CASE_ID]","input":"[RAW_TEXT_OR_OCR]","expected":{"name":"[NAME]","email":"[EMAIL]","amount":[NNN.NN],"currency":"[ISO]","invoice_date":"[YYYY-MM-DD]"},"tags":["ocr:[YES/NO]","lang:[EN/ES]"]}
Scoring plan:
- Schema validation: pass/fail
- Field‑level Exact Match and token‑level F1 for [FIELDS] (use
compute_metrics.py)
Golden set: Summarization (summarization.jsonl)
Pair each source with its reference or QA question. Include tricky, factual passages and noisy inputs.
{"id":"[CASE_ID]","document":"[SOURCE_TEXT]","instruction":"Summarize for [AUDIENCE] in ≤[N] bullets. Must include: [KEY_FACTS].","expected":"[REFERENCE_SUMMARY]","tags":["factuality:high","length:short"]}
Pass criteria: Factuality ≥ [THRESHOLD], Relevance ≥ [THRESHOLD], No fabricated numbers (hard fail).
GEval‑style AI‑judge prompts (analysis‑then‑score)
Use analysis‑then‑score; return strict JSON. Keep the judge cross‑family from the generator.
rubric_email.md
System: You are an impartial writing quality judge. Blind to model names. First analyze, then score. Be concise.
User:
Task: Evaluate the CANDIDATE email against the SOURCE + INSTRUCTION using this rubric (0.0–1.0 each):
1) Instruction‑Following: obeys constraints (length, CTA, do/don’t)
2) Tone Fit: matches target tone + audience
3) Clarity: structure, readability, plain language
4) Safety/Policy: PII leaks, false claims, disallowed content (hard fail)
Return JSON only:
{
"analysis": "1–3 sentences citing specific lines",
"scores": {"instruction": [0–1], "tone": [0–1], "clarity": [0–1]},
"hard_fail": [true|false],
"pass": [true|false],
"pass_reason": "short reason"
}
SOURCE:
{{source_text}}
INSTRUCTION:
{{instruction}}
CANDIDATE:
{{candidate_text}}
rubric_summarization.md
System: Impartial summarization judge. Blind to model names. Analyze, then score.
User:
Rubric (0.0–1.0):
- Factuality (faithful to source)
- Relevance (includes required facts, omits fluff)
- Conciseness (meets length)
JSON output: {"analysis":"...","scores":{"factuality":x,"relevance":y,"conciseness":z},"hard_fail":false,"pass":true,"pass_reason":"..."}
SOURCE:
{{document}}
REQUIREMENTS:
{{instruction}}
CANDIDATE:
{{candidate_text}}
Pass decision default:
- Email: pass if instruction≥[0.8] AND tone≥[0.75] AND clarity≥[0.75] AND hard_fail=false
- Summarization: pass if factuality≥[0.85] AND relevance≥[0.8] AND conciseness within ±[10]% of target
Pairwise A/B judge (blind order)
Blind A/B with randomized order. The judge must choose A, B, or Tie and explain briefly.
pairwise_ab.md
System: You are an impartial A/B writing judge. Blind to model names. Compare two candidates for the same task. Analyze briefly, then pick a winner.
User:
Task: Choose the better candidate against the SOURCE + INSTRUCTION using this priority: (1) Instruction‑Following, (2) Factuality/Safety, (3) Tone Fit, (4) Clarity.
Return JSON only:
{
"analysis": "2–4 sentences comparing A vs B",
"winner": "A"|"B"|"Tie",
"confidence": 0.0–1.0
}
SOURCE:
{{source_text_or_document}}
INSTRUCTION:
{{instruction}}
CANDIDATE_A:
{{candidate_a}}
CANDIDATE_B:
{{candidate_b}}
Runtime rule: Randomize which system output is A vs B per case; store mapping for win‑rate math.
CI regression gate (promptfoo + GitHub Actions)
This is a minimal config you can run locally and in CI. It evaluates your system’s outputs against the golden sets using the judges above and fails the build on regression.
ci/promptfoo/promptfooconfig.yaml
version: 1
providers:
# Judge provider lives in a different family from your generator
- id: judge
provider: [JUDGE_PROVIDER_ID] # e.g., anthropic:claude-3-haiku or openai:gpt-4o-mini
config:
apiKeyEnv: [JUDGE_API_KEY_ENV]
# Targets are your systems under test (SUTs). Use exec/http to call them.
targets:
- id: A
type: exec
command: [CMD_TO_RUN_SYSTEM_A] # e.g., node scripts/run_a.mjs
- id: B
type: exec
command: [CMD_TO_RUN_SYSTEM_B]
scorers:
# Rubric judges for pass/fail
- id: email_rubric
type: llm-rubric
provider: judge
promptPath: ../../judges/rubric_email.md
mapping:
source_text: input_email
instruction: instruction
candidate_text: output
- id: sum_rubric
type: llm-rubric
provider: judge
promptPath: ../../judges/rubric_summarization.md
mapping:
document: document
instruction: instruction
candidate_text: output
# Pairwise judge for win rate (A vs B)
- id: pairwise_ab
type: llm-pairwise
provider: judge
promptPath: ../../judges/pairwise_ab.md
mapping:
source_text_or_document: source
instruction: instruction
candidate_a: output_A
candidate_b: output_B
# Datasets
datasets:
- id: email
path: ../../datasets/email_rewrite.jsonl
- id: extraction
path: ../../datasets/json_extraction.jsonl
- id: summarization
path: ../../datasets/summarization.jsonl
# Evaluations
runs:
- dataset: email
targets: [A]
scorers: [email_rubric]
- dataset: summarization
targets: [A]
scorers: [sum_rubric]
- dataset: extraction
targets: [A]
asserts:
- type: json-schema
schemaPath: ../../datasets/schemas/extraction.schema.json
- type: javascript
script: ../../scripts/compute_metrics.py # emits EM/F1 per field
# Pairwise A/B on a smaller slice
- dataset: email
sample: 50%
targets: [A,B]
scorers: [pairwise_ab]
thresholds:
passRate:
email:
min: [0.90]
summarization:
min: [0.88]
winRate:
pairwise_ab:
A: {min: [0.52]} # require A to win >52% to ship
GitHub Actions (ci/github-actions.yaml)
name: evals
on:
pull_request:
workflow_dispatch:
jobs:
run-evals:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: {node-version: '20'}
- run: npm ci
- run: npx promptfoo eval --config ./t/evals-loop/ci/promptfoo/promptfooconfig.yaml
- name: Fail on regression
run: |
if [ -f .promptfoo/summary.json ]; then cat .promptfoo/summary.json; fi
# promptfoo exits non-zero if thresholds not met
Local smoke test:
npx promptfoo eval --config ./t/evals-loop/ci/promptfooconfig.yaml --output .promptfoo/local
Monday scorecard (Notion template)
Create a Notion database called “LLM Evals – Weekly Scorecard”. Add these properties and formulas. Duplicate weekly on Monday.
Properties (columns):
- Week (Date)
- Model (Text) – e.g., [GENERATOR_MODEL_ID]
- Judge (Text) – e.g., [JUDGE_MODEL_ID]
- Datasets (Multi-select): Email, Extraction, Summarization
- Pass Rate – Email (%) [Number]
- Pass Rate – Summ (%) [Number]
- Win Rate – A vs B (%) [Number]
- p95 Latency (ms) [Number]
- Cost / Job ($) [Number]
- Cost / 100 Jobs ($) [Formula]
- Judge Agreement (%) [Number]
- Human Sample Accuracy (%) [Number]
- Incidents (Text)
- Decision (Select): Roll Forward, Hold, Roll Back
- Notes (Text)
Formulas:
- Cost / 100 Jobs =
prop("Cost / Job ($)") * 100
How to fill it fast (Monday 30 min):
- Export last week’s
.promptfoo/summary.jsonand latency/cost logs from [OBSERVABILITY_TOOL] or app logs. - Paste pass/win rates and judge agreement.
- Compute Cost / Job from token usage exports (see “Cost tracking spec”).
- Paste p95 latency from logs.
- Enter Human Sample Accuracy from your review sample.
- Set Decision and owner for next action.
Human sampling rules (10–20%)
Use this for drift calibration without blowing up your week. Start at 10%, go to 20% during changes or incidents.
Sampling plan:
- Weekly sample size:
[MAX(ceil(0.1 * WEEKLY_JOB_COUNT), [MIN_SAMPLE]])→ aim for 10–20%. - Randomization:
reservoirorrand() < pon job IDs; exclude golden‑set traffic. - Focused strata: always include borderline judge cases (confidence in [0.45–0.55]) and all judge ties.
- Escalation: any hard‑fail from AI‑judge (policy/PII) → human review 100% for that segment until cleared 2 weeks.
- Reviewer guide: use
scripts/sample_human_review.mdwith the same rubric, record pass/fail and comments.
Tracking fields per reviewed job:
- job_id, created_at, test_type, candidate, judge_pass, human_pass, mismatch (
true/false), mismatch_reason, followup_action.
Target: Human Sample Accuracy ≥ [TARGET_% e.g., 95%] alignment with judge decisions. If < target two weeks in a row → tighten thresholds or change judge model.
Deprecations watch checklist (CHECKLIST_deprecations.md)
Keep generator and judge models from different families. Pin exact snapshots where possible. Rehearse rollbacks.
Include this file in your repo and review monthly.
[ ] Pin exact model IDs in env: GENERATOR=[PROVIDER:MODEL@SNAPSHOT], JUDGE=[PROVIDER:MODEL@SNAPSHOT]
[ ] Subscribe to provider changelogs + deprecations pages
[ ] Add calendar reminder: quarterly judge recalibration (re‑label 20 golden cases)
[ ] On model update PRs: run full offline evals + pairwise on 50% slice
[ ] Keep last 2 known‑good model snapshots + prompts for rollback
[ ] Refresh golden sets monthly (add 5–10 fresh, risky cases)
[ ] Track provider‑side safety/policy changes that may flip hard‑fail logic
Owner: [NAME]. Review cadence: [DAY_OF_WEEK].
Cost + latency tracking spec
Log token usage and latency per job so cost math is trivial on Monday.
Required fields per job:
- job_id, test_type, started_at, completed_at
- input_tokens, output_tokens
- provider, model
- error (nullable)
Cost formula (per job):
cost_job = (input_tokens/1000 * PRICE_IN_PER_1K) + (output_tokens/1000 * PRICE_OUT_PER_1K)
p95 latency (ms): compute in your warehouse or scripts/compute_metrics.py and paste to scorecard.
Tip: Build a tiny SELECT for last 7 days and export CSV columns you need for the scorecard.
Extraction metrics helper (EM/F1) – optional
Drop this starter script; adapt to your data source. It computes EM/F1 for extraction fields and summarizes pass rates.
scripts/compute_metrics.py
#!/usr/bin/env python3
import json, sys, collections
from typing import List
def tokens(s):
return s.lower().split()
def f1(pred, gold):
p, g = tokens(str(pred)), tokens(str(gold))
if not p and not g: return 1.0
if not p or not g: return 0.0
inter = collections.Counter(p) & collections.Counter(g)
tp = sum(inter.values())
prec = tp/len(p); rec = tp/len(g)
return 0 if (prec+rec)==0 else 2*prec*rec/(prec+rec)
cases = [json.loads(l) for l in sys.stdin]
field_scores = collections.defaultdict(list)
passes = 0
for c in cases:
exp, pred = c.get('expected',{}), c.get('output',{})
hard_fail = 0
for k in exp:
em = 1.0 if str(exp[k])==str(pred.get(k)) else 0.0
f = f1(pred.get(k,''), exp[k]) if isinstance(exp[k], str) else em
field_scores[k].append((em,f))
# example hard fail: email format wrong
if k=='email' and '@' not in str(pred.get(k,'')): hard_fail = 1
pass_case = (sum(em for em,_ in field_scores[k][-1:])>=1) and not hard_fail
passes += int(pass_case)
print(json.dumps({
'field_em_f1': {k:{'em':sum(em for em,_ in v)/len(v), 'f1':sum(f for _,f in v)/len(v)} for k,v in field_scores.items()},
'pass_rate': passes/len(cases)
}, indent=2))
Usage in config: pipe the dataset with predictions into this script; parse pass_rate.
Weekly rhythm (Lisbon‑proof)
Use this exact cadence to keep the loop light and reliable while traveling.
- Friday (15 min): Add 3–5 fresh cases per dataset from production traces. Commit.
- Sunday (10 min): Open PR with any prompt/model changes; CI must pass thresholds.
- Monday (30 min): Update scorecard; decide Roll Forward/Hold/Roll Back; assign one action.
- Daily: Alert on p95 latency > [MS] or Cost / 100 jobs > [$]; investigate before EOD.
- Monthly: Refresh golden sets; close stale failures; rotate judge if agreement < [TARGET_%].