Template

Minimal Evals Loop Starter Kit (Golden Sets + AI‑Judge + Pairwise A/B)

A copy‑and‑ship template to stand up a minimal LLM evals loop: JSONL golden sets (email rewrite, JSON extraction, summarization), GEval‑style judges, a blind pairwise A/B judge, a promptfoo CI regression gate, a Monday Notion scorecard, and a deprecations checklist for drift control.

Copy this into your repo or Notion and fill in the [BRACKETS]. The kit gives you three shippable eval types (email rewrite, JSON extraction, summarization), GEval‑style judge prompts, a pairwise A/B judge, a CI regression gate, and a Monday scorecard. Keep the generator and judge in different model families, randomize A/B order, and human‑review 10–20% of live jobs weekly.

Repo scaffold (copy/paste)

Drop this structure anywhere in your app repo or a separate /t/evals-loop folder. Keep datasets in JSONL for easy tooling interop.

Folder tree

/t/evals-loop/
  datasets/
    email_rewrite.jsonl
    json_extraction.jsonl
    summarization.jsonl
    schemas/
      extraction.schema.json
  judges/
    rubric_email.md
    rubric_summarization.md
    pairwise_ab.md
  ci/
    promptfoo/promptfooconfig.yaml
    github-actions.yaml
  scripts/
    compute_metrics.py
    sample_human_review.md
  notion/
    Weekly Scorecard (properties).md
  CHECKLIST_deprecations.md
README.md

Golden set: Email rewrite (email_rewrite.jsonl)

Each line is one test case. Use small, representative, and a bit adversarial data. Tag risky patterns (tone, jargon, redaction) to slice later.

{"id":"[CASE_ID]","input_email":"[PASTE_RAW_EMAIL]","instruction":"Rewrite to [TONE/TASK] in ≤[N] words; keep [CONSTRAINTS]","expected":"[TARGET_REWRITE_OR_REFERENCE]","tags":["tone:[CASUAL/FORMAL]","region:[US/EU]","pii:[YES/NO]"]}
{"id":"[CASE_ID]","input_email":"...","instruction":"...","expected":"...","tags":["..."]}

Pass criteria example (for judges below):

  • Instruction‑following ≥ [THRESHOLD_1] (e.g., 0.8)
  • Tone fit ≥ [THRESHOLD_2] (e.g., 0.75)
  • Clarity ≥ [THRESHOLD_3]
  • No policy/PII leaks (hard fail)

Golden set: JSON extraction (json_extraction.jsonl + schemas/extraction.schema.json)

Define a single JSON Schema for validation, then pair each input with the expected object. Use EM/F1 per field for softer checks (dates, names).

Schema + example (copy as‑is, then edit)

extraction.schema.json

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://[YOUR_DOMAIN]/schemas/extraction.schema.json",
  "title": "[ENTITY] Extraction",
  "type": "object",
  "required": ["name","email","amount","currency"],
  "properties": {
    "name": {"type": "string", "minLength": 1},
    "email": {"type": "string", "format": "email"},
    "amount": {"type": "number", "minimum": 0},
    "currency": {"type": "string", "enum": ["USD","EUR","GBP"]},
    "invoice_date": {"type": "string", "format": "date"}
  },
  "additionalProperties": false
}

json_extraction.jsonl

{"id":"[CASE_ID]","input":"[RAW_TEXT_OR_OCR]","expected":{"name":"[NAME]","email":"[EMAIL]","amount":[NNN.NN],"currency":"[ISO]","invoice_date":"[YYYY-MM-DD]"},"tags":["ocr:[YES/NO]","lang:[EN/ES]"]}

Scoring plan:

  • Schema validation: pass/fail
  • Field‑level Exact Match and token‑level F1 for [FIELDS] (use compute_metrics.py)

Golden set: Summarization (summarization.jsonl)

Pair each source with its reference or QA question. Include tricky, factual passages and noisy inputs.

{"id":"[CASE_ID]","document":"[SOURCE_TEXT]","instruction":"Summarize for [AUDIENCE] in ≤[N] bullets. Must include: [KEY_FACTS].","expected":"[REFERENCE_SUMMARY]","tags":["factuality:high","length:short"]}

Pass criteria: Factuality ≥ [THRESHOLD], Relevance ≥ [THRESHOLD], No fabricated numbers (hard fail).

GEval‑style AI‑judge prompts (analysis‑then‑score)

Use analysis‑then‑score; return strict JSON. Keep the judge cross‑family from the generator.

rubric_email.md

System: You are an impartial writing quality judge. Blind to model names. First analyze, then score. Be concise.

User:
Task: Evaluate the CANDIDATE email against the SOURCE + INSTRUCTION using this rubric (0.0–1.0 each):
1) Instruction‑Following: obeys constraints (length, CTA, do/don’t)
2) Tone Fit: matches target tone + audience
3) Clarity: structure, readability, plain language
4) Safety/Policy: PII leaks, false claims, disallowed content (hard fail)

Return JSON only:
{
  "analysis": "1–3 sentences citing specific lines",
  "scores": {"instruction": [0–1], "tone": [0–1], "clarity": [0–1]},
  "hard_fail": [true|false],
  "pass": [true|false],
  "pass_reason": "short reason"
}

SOURCE:
{{source_text}}

INSTRUCTION:
{{instruction}}

CANDIDATE:
{{candidate_text}}

rubric_summarization.md

System: Impartial summarization judge. Blind to model names. Analyze, then score.

User:
Rubric (0.0–1.0):
- Factuality (faithful to source)
- Relevance (includes required facts, omits fluff)
- Conciseness (meets length)
JSON output: {"analysis":"...","scores":{"factuality":x,"relevance":y,"conciseness":z},"hard_fail":false,"pass":true,"pass_reason":"..."}

SOURCE:
{{document}}

REQUIREMENTS:
{{instruction}}

CANDIDATE:
{{candidate_text}}

Pass decision default:

  • Email: pass if instruction≥[0.8] AND tone≥[0.75] AND clarity≥[0.75] AND hard_fail=false
  • Summarization: pass if factuality≥[0.85] AND relevance≥[0.8] AND conciseness within ±[10]% of target

Pairwise A/B judge (blind order)

Blind A/B with randomized order. The judge must choose A, B, or Tie and explain briefly.

pairwise_ab.md

System: You are an impartial A/B writing judge. Blind to model names. Compare two candidates for the same task. Analyze briefly, then pick a winner.

User:
Task: Choose the better candidate against the SOURCE + INSTRUCTION using this priority: (1) Instruction‑Following, (2) Factuality/Safety, (3) Tone Fit, (4) Clarity.
Return JSON only:
{
  "analysis": "2–4 sentences comparing A vs B",
  "winner": "A"|"B"|"Tie",
  "confidence": 0.0–1.0
}

SOURCE:
{{source_text_or_document}}

INSTRUCTION:
{{instruction}}

CANDIDATE_A:
{{candidate_a}}

CANDIDATE_B:
{{candidate_b}}

Runtime rule: Randomize which system output is A vs B per case; store mapping for win‑rate math.

CI regression gate (promptfoo + GitHub Actions)

This is a minimal config you can run locally and in CI. It evaluates your system’s outputs against the golden sets using the judges above and fails the build on regression.

ci/promptfoo/promptfooconfig.yaml

version: 1

providers:
  # Judge provider lives in a different family from your generator
  - id: judge
    provider: [JUDGE_PROVIDER_ID]   # e.g., anthropic:claude-3-haiku or openai:gpt-4o-mini
    config:
      apiKeyEnv: [JUDGE_API_KEY_ENV]

# Targets are your systems under test (SUTs). Use exec/http to call them.
targets:
  - id: A
    type: exec
    command: [CMD_TO_RUN_SYSTEM_A]   # e.g., node scripts/run_a.mjs
  - id: B
    type: exec
    command: [CMD_TO_RUN_SYSTEM_B]

scorers:
  # Rubric judges for pass/fail
  - id: email_rubric
    type: llm-rubric
    provider: judge
    promptPath: ../../judges/rubric_email.md
    mapping:
      source_text: input_email
      instruction: instruction
      candidate_text: output
  - id: sum_rubric
    type: llm-rubric
    provider: judge
    promptPath: ../../judges/rubric_summarization.md
    mapping:
      document: document
      instruction: instruction
      candidate_text: output
  # Pairwise judge for win rate (A vs B)
  - id: pairwise_ab
    type: llm-pairwise
    provider: judge
    promptPath: ../../judges/pairwise_ab.md
    mapping:
      source_text_or_document: source
      instruction: instruction
      candidate_a: output_A
      candidate_b: output_B

# Datasets
datasets:
  - id: email
    path: ../../datasets/email_rewrite.jsonl
  - id: extraction
    path: ../../datasets/json_extraction.jsonl
  - id: summarization
    path: ../../datasets/summarization.jsonl

# Evaluations
runs:
  - dataset: email
    targets: [A]
    scorers: [email_rubric]
  - dataset: summarization
    targets: [A]
    scorers: [sum_rubric]
  - dataset: extraction
    targets: [A]
    asserts:
      - type: json-schema
        schemaPath: ../../datasets/schemas/extraction.schema.json
      - type: javascript
        script: ../../scripts/compute_metrics.py   # emits EM/F1 per field

  # Pairwise A/B on a smaller slice
  - dataset: email
    sample: 50%
    targets: [A,B]
    scorers: [pairwise_ab]

thresholds:
  passRate: 
    email: 
      min: [0.90]
    summarization:
      min: [0.88]
  winRate:
    pairwise_ab:
      A: {min: [0.52]}   # require A to win >52% to ship

GitHub Actions (ci/github-actions.yaml)

name: evals
on:
  pull_request:
  workflow_dispatch:
jobs:
  run-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: {node-version: '20'}
      - run: npm ci
      - run: npx promptfoo eval --config ./t/evals-loop/ci/promptfoo/promptfooconfig.yaml
      - name: Fail on regression
        run: |
          if [ -f .promptfoo/summary.json ]; then cat .promptfoo/summary.json; fi
          # promptfoo exits non-zero if thresholds not met

Local smoke test:

npx promptfoo eval --config ./t/evals-loop/ci/promptfooconfig.yaml --output .promptfoo/local

Monday scorecard (Notion template)

Create a Notion database called “LLM Evals – Weekly Scorecard”. Add these properties and formulas. Duplicate weekly on Monday.

Properties (columns):

  • Week (Date)
  • Model (Text) – e.g., [GENERATOR_MODEL_ID]
  • Judge (Text) – e.g., [JUDGE_MODEL_ID]
  • Datasets (Multi-select): Email, Extraction, Summarization
  • Pass Rate – Email (%) [Number]
  • Pass Rate – Summ (%) [Number]
  • Win Rate – A vs B (%) [Number]
  • p95 Latency (ms) [Number]
  • Cost / Job ($) [Number]
  • Cost / 100 Jobs ($) [Formula]
  • Judge Agreement (%) [Number]
  • Human Sample Accuracy (%) [Number]
  • Incidents (Text)
  • Decision (Select): Roll Forward, Hold, Roll Back
  • Notes (Text)

Formulas:

  • Cost / 100 Jobs = prop("Cost / Job ($)") * 100

How to fill it fast (Monday 30 min):

  1. Export last week’s .promptfoo/summary.json and latency/cost logs from [OBSERVABILITY_TOOL] or app logs.
  2. Paste pass/win rates and judge agreement.
  3. Compute Cost / Job from token usage exports (see “Cost tracking spec”).
  4. Paste p95 latency from logs.
  5. Enter Human Sample Accuracy from your review sample.
  6. Set Decision and owner for next action.

Human sampling rules (10–20%)

Use this for drift calibration without blowing up your week. Start at 10%, go to 20% during changes or incidents.

Sampling plan:

  • Weekly sample size: [MAX(ceil(0.1 * WEEKLY_JOB_COUNT), [MIN_SAMPLE]]) → aim for 10–20%.
  • Randomization: reservoir or rand() < p on job IDs; exclude golden‑set traffic.
  • Focused strata: always include borderline judge cases (confidence in [0.45–0.55]) and all judge ties.
  • Escalation: any hard‑fail from AI‑judge (policy/PII) → human review 100% for that segment until cleared 2 weeks.
  • Reviewer guide: use scripts/sample_human_review.md with the same rubric, record pass/fail and comments.

Tracking fields per reviewed job:

  • job_id, created_at, test_type, candidate, judge_pass, human_pass, mismatch (true/false), mismatch_reason, followup_action.

Target: Human Sample Accuracy ≥ [TARGET_% e.g., 95%] alignment with judge decisions. If < target two weeks in a row → tighten thresholds or change judge model.

Deprecations watch checklist (CHECKLIST_deprecations.md)

Keep generator and judge models from different families. Pin exact snapshots where possible. Rehearse rollbacks.

Include this file in your repo and review monthly.

[ ] Pin exact model IDs in env: GENERATOR=[PROVIDER:MODEL@SNAPSHOT], JUDGE=[PROVIDER:MODEL@SNAPSHOT]
[ ] Subscribe to provider changelogs + deprecations pages
[ ] Add calendar reminder: quarterly judge recalibration (re‑label 20 golden cases)
[ ] On model update PRs: run full offline evals + pairwise on 50% slice
[ ] Keep last 2 known‑good model snapshots + prompts for rollback
[ ] Refresh golden sets monthly (add 5–10 fresh, risky cases)
[ ] Track provider‑side safety/policy changes that may flip hard‑fail logic

Owner: [NAME]. Review cadence: [DAY_OF_WEEK].

Cost + latency tracking spec

Log token usage and latency per job so cost math is trivial on Monday.

Required fields per job:

  • job_id, test_type, started_at, completed_at
  • input_tokens, output_tokens
  • provider, model
  • error (nullable)

Cost formula (per job):

cost_job = (input_tokens/1000 * PRICE_IN_PER_1K) + (output_tokens/1000 * PRICE_OUT_PER_1K)

p95 latency (ms): compute in your warehouse or scripts/compute_metrics.py and paste to scorecard.

Tip: Build a tiny SELECT for last 7 days and export CSV columns you need for the scorecard.

Extraction metrics helper (EM/F1) – optional

Drop this starter script; adapt to your data source. It computes EM/F1 for extraction fields and summarizes pass rates.

scripts/compute_metrics.py

#!/usr/bin/env python3
import json, sys, collections
from typing import List

def tokens(s):
  return s.lower().split()

def f1(pred, gold):
  p, g = tokens(str(pred)), tokens(str(gold))
  if not p and not g: return 1.0
  if not p or not g: return 0.0
  inter = collections.Counter(p) &amp; collections.Counter(g)
  tp = sum(inter.values())
  prec = tp/len(p); rec = tp/len(g)
  return 0 if (prec+rec)==0 else 2*prec*rec/(prec+rec)

cases = [json.loads(l) for l in sys.stdin]
field_scores = collections.defaultdict(list)
passes = 0
for c in cases:
  exp, pred = c.get(&#39;expected&#39;,{}), c.get(&#39;output&#39;,{})
  hard_fail = 0
  for k in exp:
    em = 1.0 if str(exp[k])==str(pred.get(k)) else 0.0
    f = f1(pred.get(k,&#39;&#39;), exp[k]) if isinstance(exp[k], str) else em
    field_scores[k].append((em,f))
    # example hard fail: email format wrong
    if k==&#39;email&#39; and &#39;@&#39; not in str(pred.get(k,&#39;&#39;)): hard_fail = 1
  pass_case = (sum(em for em,_ in field_scores[k][-1:])&gt;=1) and not hard_fail
  passes += int(pass_case)

print(json.dumps({
  &#39;field_em_f1&#39;: {k:{&#39;em&#39;:sum(em for em,_ in v)/len(v), &#39;f1&#39;:sum(f for _,f in v)/len(v)} for k,v in field_scores.items()},
  &#39;pass_rate&#39;: passes/len(cases)
}, indent=2))

Usage in config: pipe the dataset with predictions into this script; parse pass_rate.

Weekly rhythm (Lisbon‑proof)

Use this exact cadence to keep the loop light and reliable while traveling.

  • Friday (15 min): Add 3–5 fresh cases per dataset from production traces. Commit.
  • Sunday (10 min): Open PR with any prompt/model changes; CI must pass thresholds.
  • Monday (30 min): Update scorecard; decide Roll Forward/Hold/Roll Back; assign one action.
  • Daily: Alert on p95 latency > [MS] or Cost / 100 jobs > [$]; investigate before EOD.
  • Monthly: Refresh golden sets; close stale failures; rotate judge if agreement < [TARGET_%].