The Stateless Founder

← All Resources

TemplateMay 15, 2026

Minimal Evals Loop Starter Kit (Golden Sets + AI‑Judge + Pairwise A/B)

A copy‑and‑ship template to stand up a minimal LLM evals loop: JSONL golden sets (email rewrite, JSON extraction, summarization), GEval‑style judges, a blind pairwise A/B judge, a promptfoo CI regression gate, a Monday Notion scorecard, and a deprecations checklist for drift control.

Contents↓ Download PDF

Repo scaffold (copy/paste)Folder tree Golden set: Email rewrite (email_rewrite.jsonl)Golden set: JSON extraction (json_extraction.jsonl + schemas/extraction.schema.json)Schema + example (copy as‑is, then edit)Golden set: Summarization (summarization.jsonl)GEval‑style AI‑judge prompts (analysis‑then‑score)Pairwise A/B judge (blind order)CI regression gate (promptfoo + GitHub Actions)Monday scorecard (Notion template)Human sampling rules (10–20%)Deprecations watch checklist (CHECKLIST_deprecations.md)Cost + latency tracking spec Extraction metrics helper (EM/F1) – optional Weekly rhythm (Lisbon‑proof)

Copy this into your repo or Notion and fill in the [BRACKETS]. The kit gives you three shippable eval types (email rewrite, JSON extraction, summarization), GEval‑style judge prompts, a pairwise A/B judge, a CI regression gate, and a Monday scorecard. Keep the generator and judge in different model families, randomize A/B order, and human‑review 10–20% of live jobs weekly.

Repo scaffold (copy/paste)

Drop this structure anywhere in your app repo or a separate /t/evals-loop folder. Keep datasets in JSONL for easy tooling interop.

↑ Back to top

Folder tree

/t/evals-loop/
  datasets/
    email_rewrite.jsonl
    json_extraction.jsonl
    summarization.jsonl
    schemas/
      extraction.schema.json
  judges/
    rubric_email.md
    rubric_summarization.md
    pairwise_ab.md
  ci/
    promptfoo/promptfooconfig.yaml
    github-actions.yaml
  scripts/
    compute_metrics.py
    sample_human_review.md
  notion/
    Weekly Scorecard (properties).md
  CHECKLIST_deprecations.md
README.md

↑ Back to top

Golden set: Email rewrite (email_rewrite.jsonl)

Each line is one test case. Use small, representative, and a bit adversarial data. Tag risky patterns (tone, jargon, redaction) to slice later.

{&quot;id&quot;:&quot;[CASE_ID]&quot;,&quot;input_email&quot;:&quot;[PASTE_RAW_EMAIL]&quot;,&quot;instruction&quot;:&quot;Rewrite to [TONE/TASK] in ≤[N] words; keep [CONSTRAINTS]&quot;,&quot;expected&quot;:&quot;[TARGET_REWRITE_OR_REFERENCE]&quot;,&quot;tags&quot;:[&quot;tone:[CASUAL/FORMAL]&quot;,&quot;region:[US/EU]&quot;,&quot;pii:[YES/NO]&quot;]}
{&quot;id&quot;:&quot;[CASE_ID]&quot;,&quot;input_email&quot;:&quot;...&quot;,&quot;instruction&quot;:&quot;...&quot;,&quot;expected&quot;:&quot;...&quot;,&quot;tags&quot;:[&quot;...&quot;]}

Pass criteria example (for judges below):

Instruction‑following ≥ [THRESHOLD_1] (e.g., 0.8)
Tone fit ≥ [THRESHOLD_2] (e.g., 0.75)
Clarity ≥ [THRESHOLD_3]
No policy/PII leaks (hard fail)

↑ Back to top

Golden set: JSON extraction (json_extraction.jsonl + schemas/extraction.schema.json)

Define a single JSON Schema for validation, then pair each input with the expected object. Use EM/F1 per field for softer checks (dates, names).

↑ Back to top

Schema + example (copy as‑is, then edit)

extraction.schema.json

{
  &quot;$schema&quot;: &quot;https://json-schema.org/draft/2020-12/schema&quot;,
  &quot;$id&quot;: &quot;https://[YOUR_DOMAIN]/schemas/extraction.schema.json&quot;,
  &quot;title&quot;: &quot;[ENTITY] Extraction&quot;,
  &quot;type&quot;: &quot;object&quot;,
  &quot;required&quot;: [&quot;name&quot;,&quot;email&quot;,&quot;amount&quot;,&quot;currency&quot;],
  &quot;properties&quot;: {
    &quot;name&quot;: {&quot;type&quot;: &quot;string&quot;, &quot;minLength&quot;: 1},
    &quot;email&quot;: {&quot;type&quot;: &quot;string&quot;, &quot;format&quot;: &quot;email&quot;},
    &quot;amount&quot;: {&quot;type&quot;: &quot;number&quot;, &quot;minimum&quot;: 0},
    &quot;currency&quot;: {&quot;type&quot;: &quot;string&quot;, &quot;enum&quot;: [&quot;USD&quot;,&quot;EUR&quot;,&quot;GBP&quot;]},
    &quot;invoice_date&quot;: {&quot;type&quot;: &quot;string&quot;, &quot;format&quot;: &quot;date&quot;}
  },
  &quot;additionalProperties&quot;: false
}

json_extraction.jsonl

{&quot;id&quot;:&quot;[CASE_ID]&quot;,&quot;input&quot;:&quot;[RAW_TEXT_OR_OCR]&quot;,&quot;expected&quot;:{&quot;name&quot;:&quot;[NAME]&quot;,&quot;email&quot;:&quot;[EMAIL]&quot;,&quot;amount&quot;:[NNN.NN],&quot;currency&quot;:&quot;[ISO]&quot;,&quot;invoice_date&quot;:&quot;[YYYY-MM-DD]&quot;},&quot;tags&quot;:[&quot;ocr:[YES/NO]&quot;,&quot;lang:[EN/ES]&quot;]}

Scoring plan:

Schema validation: pass/fail
Field‑level Exact Match and token‑level F1 for [FIELDS] (use compute_metrics.py)

↑ Back to top

Golden set: Summarization (summarization.jsonl)

Pair each source with its reference or QA question. Include tricky, factual passages and noisy inputs.

{&quot;id&quot;:&quot;[CASE_ID]&quot;,&quot;document&quot;:&quot;[SOURCE_TEXT]&quot;,&quot;instruction&quot;:&quot;Summarize for [AUDIENCE] in ≤[N] bullets. Must include: [KEY_FACTS].&quot;,&quot;expected&quot;:&quot;[REFERENCE_SUMMARY]&quot;,&quot;tags&quot;:[&quot;factuality:high&quot;,&quot;length:short&quot;]}

Pass criteria: Factuality ≥ [THRESHOLD], Relevance ≥ [THRESHOLD], No fabricated numbers (hard fail).

↑ Back to top

GEval‑style AI‑judge prompts (analysis‑then‑score)

Use analysis‑then‑score; return strict JSON. Keep the judge cross‑family from the generator.

rubric_email.md

System: You are an impartial writing quality judge. Blind to model names. First analyze, then score. Be concise.

User:
Task: Evaluate the CANDIDATE email against the SOURCE + INSTRUCTION using this rubric (0.0–1.0 each):
1) Instruction‑Following: obeys constraints (length, CTA, do/don’t)
2) Tone Fit: matches target tone + audience
3) Clarity: structure, readability, plain language
4) Safety/Policy: PII leaks, false claims, disallowed content (hard fail)

Return JSON only:
{
  &quot;analysis&quot;: &quot;1–3 sentences citing specific lines&quot;,
  &quot;scores&quot;: {&quot;instruction&quot;: [0–1], &quot;tone&quot;: [0–1], &quot;clarity&quot;: [0–1]},
  &quot;hard_fail&quot;: [true|false],
  &quot;pass&quot;: [true|false],
  &quot;pass_reason&quot;: &quot;short reason&quot;
}

SOURCE:
{{source_text}}

INSTRUCTION:
{{instruction}}

CANDIDATE:
{{candidate_text}}

rubric_summarization.md

System: Impartial summarization judge. Blind to model names. Analyze, then score.

User:
Rubric (0.0–1.0):
- Factuality (faithful to source)
- Relevance (includes required facts, omits fluff)
- Conciseness (meets length)
JSON output: {&quot;analysis&quot;:&quot;...&quot;,&quot;scores&quot;:{&quot;factuality&quot;:x,&quot;relevance&quot;:y,&quot;conciseness&quot;:z},&quot;hard_fail&quot;:false,&quot;pass&quot;:true,&quot;pass_reason&quot;:&quot;...&quot;}

SOURCE:
{{document}}

REQUIREMENTS:
{{instruction}}

CANDIDATE:
{{candidate_text}}

Pass decision default:

Email: pass if instruction≥[0.8] AND tone≥[0.75] AND clarity≥[0.75] AND hard_fail=false
Summarization: pass if factuality≥[0.85] AND relevance≥[0.8] AND conciseness within ±[10]% of target

↑ Back to top

Pairwise A/B judge (blind order)

Blind A/B with randomized order. The judge must choose A, B, or Tie and explain briefly.

pairwise_ab.md

System: You are an impartial A/B writing judge. Blind to model names. Compare two candidates for the same task. Analyze briefly, then pick a winner.

User:
Task: Choose the better candidate against the SOURCE + INSTRUCTION using this priority: (1) Instruction‑Following, (2) Factuality/Safety, (3) Tone Fit, (4) Clarity.
Return JSON only:
{
  &quot;analysis&quot;: &quot;2–4 sentences comparing A vs B&quot;,
  &quot;winner&quot;: &quot;A&quot;|&quot;B&quot;|&quot;Tie&quot;,
  &quot;confidence&quot;: 0.0–1.0
}

SOURCE:
{{source_text_or_document}}

INSTRUCTION:
{{instruction}}

CANDIDATE_A:
{{candidate_a}}

CANDIDATE_B:
{{candidate_b}}

Runtime rule: Randomize which system output is A vs B per case; store mapping for win‑rate math.

↑ Back to top

CI regression gate (promptfoo + GitHub Actions)

This is a minimal config you can run locally and in CI. It evaluates your system’s outputs against the golden sets using the judges above and fails the build on regression.

ci/promptfoo/promptfooconfig.yaml

version: 1

providers:
  # Judge provider lives in a different family from your generator
  - id: judge
    provider: [JUDGE_PROVIDER_ID]   # e.g., anthropic:claude-3-haiku or openai:gpt-4o-mini
    config:
      apiKeyEnv: [JUDGE_API_KEY_ENV]

# Targets are your systems under test (SUTs). Use exec/http to call them.
targets:
  - id: A
    type: exec
    command: [CMD_TO_RUN_SYSTEM_A]   # e.g., node scripts/run_a.mjs
  - id: B
    type: exec
    command: [CMD_TO_RUN_SYSTEM_B]

scorers:
  # Rubric judges for pass/fail
  - id: email_rubric
    type: llm-rubric
    provider: judge
    promptPath: ../../judges/rubric_email.md
    mapping:
      source_text: input_email
      instruction: instruction
      candidate_text: output
  - id: sum_rubric
    type: llm-rubric
    provider: judge
    promptPath: ../../judges/rubric_summarization.md
    mapping:
      document: document
      instruction: instruction
      candidate_text: output
  # Pairwise judge for win rate (A vs B)
  - id: pairwise_ab
    type: llm-pairwise
    provider: judge
    promptPath: ../../judges/pairwise_ab.md
    mapping:
      source_text_or_document: source
      instruction: instruction
      candidate_a: output_A
      candidate_b: output_B

# Datasets
datasets:
  - id: email
    path: ../../datasets/email_rewrite.jsonl
  - id: extraction
    path: ../../datasets/json_extraction.jsonl
  - id: summarization
    path: ../../datasets/summarization.jsonl

# Evaluations
runs:
  - dataset: email
    targets: [A]
    scorers: [email_rubric]
  - dataset: summarization
    targets: [A]
    scorers: [sum_rubric]
  - dataset: extraction
    targets: [A]
    asserts:
      - type: json-schema
        schemaPath: ../../datasets/schemas/extraction.schema.json
      - type: javascript
        script: ../../scripts/compute_metrics.py   # emits EM/F1 per field

  # Pairwise A/B on a smaller slice
  - dataset: email
    sample: 50%
    targets: [A,B]
    scorers: [pairwise_ab]

thresholds:
  passRate: 
    email: 
      min: [0.90]
    summarization:
      min: [0.88]
  winRate:
    pairwise_ab:
      A: {min: [0.52]}   # require A to win &gt;52% to ship

GitHub Actions (ci/github-actions.yaml)

name: evals
on:
  pull_request:
  workflow_dispatch:
jobs:
  run-evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: {node-version: &#39;20&#39;}
      - run: npm ci
      - run: npx promptfoo eval --config ./t/evals-loop/ci/promptfoo/promptfooconfig.yaml
      - name: Fail on regression
        run: |
          if [ -f .promptfoo/summary.json ]; then cat .promptfoo/summary.json; fi
          # promptfoo exits non-zero if thresholds not met

Local smoke test:

npx promptfoo eval --config ./t/evals-loop/ci/promptfooconfig.yaml --output .promptfoo/local

↑ Back to top

Monday scorecard (Notion template)

Create a Notion database called “LLM Evals – Weekly Scorecard”. Add these properties and formulas. Duplicate weekly on Monday.

Properties (columns):

Week (Date)
Model (Text) – e.g., [GENERATOR_MODEL_ID]
Judge (Text) – e.g., [JUDGE_MODEL_ID]
Datasets (Multi-select): Email, Extraction, Summarization
Pass Rate – Email (%) [Number]
Pass Rate – Summ (%) [Number]
Win Rate – A vs B (%) [Number]
p95 Latency (ms) [Number]
Cost / Job ($) [Number]
Cost / 100 Jobs ($) [Formula]
Judge Agreement (%) [Number]
Human Sample Accuracy (%) [Number]
Incidents (Text)
Decision (Select): Roll Forward, Hold, Roll Back
Notes (Text)

Formulas:

Cost / 100 Jobs = prop("Cost / Job ($)") * 100

How to fill it fast (Monday 30 min):

Export last week’s .promptfoo/summary.json and latency/cost logs from [OBSERVABILITY_TOOL] or app logs.
Paste pass/win rates and judge agreement.
Compute Cost / Job from token usage exports (see “Cost tracking spec”).
Paste p95 latency from logs.
Enter Human Sample Accuracy from your review sample.
Set Decision and owner for next action.

↑ Back to top

Human sampling rules (10–20%)

Use this for drift calibration without blowing up your week. Start at 10%, go to 20% during changes or incidents.

Sampling plan:

Weekly sample size: [MAX(ceil(0.1 * WEEKLY_JOB_COUNT), [MIN_SAMPLE]]) → aim for 10–20%.
Randomization: reservoir or rand() < p on job IDs; exclude golden‑set traffic.
Focused strata: always include borderline judge cases (confidence in [0.45–0.55]) and all judge ties.
Escalation: any hard‑fail from AI‑judge (policy/PII) → human review 100% for that segment until cleared 2 weeks.
Reviewer guide: use scripts/sample_human_review.md with the same rubric, record pass/fail and comments.

Tracking fields per reviewed job:

job_id, created_at, test_type, candidate, judge_pass, human_pass, mismatch (true/false), mismatch_reason, followup_action.

Target: Human Sample Accuracy ≥ [TARGET_% e.g., 95%] alignment with judge decisions. If < target two weeks in a row → tighten thresholds or change judge model.

↑ Back to top

Deprecations watch checklist (CHECKLIST_deprecations.md)

Keep generator and judge models from different families. Pin exact snapshots where possible. Rehearse rollbacks.

Include this file in your repo and review monthly.

[ ] Pin exact model IDs in env: GENERATOR=[PROVIDER:MODEL@SNAPSHOT], JUDGE=[PROVIDER:MODEL@SNAPSHOT]
[ ] Subscribe to provider changelogs + deprecations pages
[ ] Add calendar reminder: quarterly judge recalibration (re‑label 20 golden cases)
[ ] On model update PRs: run full offline evals + pairwise on 50% slice
[ ] Keep last 2 known‑good model snapshots + prompts for rollback
[ ] Refresh golden sets monthly (add 5–10 fresh, risky cases)
[ ] Track provider‑side safety/policy changes that may flip hard‑fail logic

Owner: [NAME]. Review cadence: [DAY_OF_WEEK].

↑ Back to top

Cost + latency tracking spec

Log token usage and latency per job so cost math is trivial on Monday.

Required fields per job:

job_id, test_type, started_at, completed_at
input_tokens, output_tokens
provider, model
error (nullable)

Cost formula (per job):

cost_job = (input_tokens/1000 * PRICE_IN_PER_1K) + (output_tokens/1000 * PRICE_OUT_PER_1K)

p95 latency (ms): compute in your warehouse or scripts/compute_metrics.py and paste to scorecard.

Tip: Build a tiny SELECT for last 7 days and export CSV columns you need for the scorecard.

↑ Back to top

Extraction metrics helper (EM/F1) – optional

Drop this starter script; adapt to your data source. It computes EM/F1 for extraction fields and summarizes pass rates.

scripts/compute_metrics.py

#!/usr/bin/env python3
import json, sys, collections
from typing import List

def tokens(s):
  return s.lower().split()

def f1(pred, gold):
  p, g = tokens(str(pred)), tokens(str(gold))
  if not p and not g: return 1.0
  if not p or not g: return 0.0
  inter = collections.Counter(p) &amp; collections.Counter(g)
  tp = sum(inter.values())
  prec = tp/len(p); rec = tp/len(g)
  return 0 if (prec+rec)==0 else 2*prec*rec/(prec+rec)

cases = [json.loads(l) for l in sys.stdin]
field_scores = collections.defaultdict(list)
passes = 0
for c in cases:
  exp, pred = c.get(&#39;expected&#39;,{}), c.get(&#39;output&#39;,{})
  hard_fail = 0
  for k in exp:
    em = 1.0 if str(exp[k])==str(pred.get(k)) else 0.0
    f = f1(pred.get(k,&#39;&#39;), exp[k]) if isinstance(exp[k], str) else em
    field_scores[k].append((em,f))
    # example hard fail: email format wrong
    if k==&#39;email&#39; and &#39;@&#39; not in str(pred.get(k,&#39;&#39;)): hard_fail = 1
  pass_case = (sum(em for em,_ in field_scores[k][-1:])&gt;=1) and not hard_fail
  passes += int(pass_case)

print(json.dumps({
  &#39;field_em_f1&#39;: {k:{&#39;em&#39;:sum(em for em,_ in v)/len(v), &#39;f1&#39;:sum(f for _,f in v)/len(v)} for k,v in field_scores.items()},
  &#39;pass_rate&#39;: passes/len(cases)
}, indent=2))

Usage in config: pipe the dataset with predictions into this script; parse pass_rate.

↑ Back to top

Weekly rhythm (Lisbon‑proof)

Use this exact cadence to keep the loop light and reliable while traveling.

Friday (15 min): Add 3–5 fresh cases per dataset from production traces. Commit.
Sunday (10 min): Open PR with any prompt/model changes; CI must pass thresholds.
Monday (30 min): Update scorecard; decide Roll Forward/Hold/Roll Back; assign one action.
Daily: Alert on p95 latency > [MS] or Cost / 100 jobs > [$]; investigate before EOD.
Monthly: Refresh golden sets; close stale failures; rotate judge if agreement < [TARGET_%].

↑ Back to top