Episode 7·

Build a Minimal LLM Evaluation Loop That Catches Regressions While You Sleep

Intro

This episode is for nomad founders running client-facing AI workflows who need predictable quality without babysitting. You'll get a copy-and-ship evaluation system that catches regressions in 30 minutes per week and prevents the Monday morning client email that says "something changed."

In This Episode

Santi and Kira build a minimal LLM evaluation loop that catches quality regressions before your clients do. Starting with Santi's real-world failure story—shipping a prompt update that broke client summaries while he was offline in Lisbon—they walk through the three-piece system: golden test sets built from real production data, AI judge prompts with cross-family bias mitigation, and pairwise A/B testing tied to a weekly scorecard. Kira challenges the reliability of AI judges, leading to a discussion of human sampling strategies and judge agreement thresholds. They demonstrate JSONL golden sets for three output types (email rewrites, JSON extraction, summarization), show how to wire Promptfoo into CI for automated regression gates, and design a Monday scorecard that tracks pass rates, costs per 100 jobs, and latency alongside judge reliability metrics. The episode includes practical guidance on model deprecation monitoring, rollback strategies, and scaling from one output type to three as your confidence builds.

Key Takeaways

  • Start with 15-20 real production cases in JSONL format for your highest-volume output type, not hundreds of synthetic examples—real edge cases are what break in production
  • Use cross-family AI judges (OpenAI generator + Anthropic judge) with blind randomized order and sample 10-20% of borderline cases for human review to maintain 95% judge-human agreement
  • Track cost-per-100-jobs alongside pass rates in a weekly Monday scorecard, with CI regression gates that fail builds when quality drops below your threshold

Timestamps

Companion Resource

Santi: Three output types. Fifty golden-set cases. One judge prompt per type. That's the entire LLM evaluation harness I run across both my businesses — and it catches regressions before my clients do.

Kira: How long did it take you to set up?

Santi: The first version? An afternoon. And I'll be honest — I put it off for months because I thought evals meant, like, a whole infrastructure project. Benchmarks, dashboards, a dedicated QA person.

Kira: You thought you needed a team.

Santi: I thought I needed a team. What I actually needed was fifty JSONL lines and a rubric prompt pointed at a different model family than my generator. That's it. That's the whole thing.

Kira: And before you had it?

Santi: Before I had it, I shipped a prompt update to my content repurposing tool on a Thursday — felt great about it, the outputs looked cleaner to me — and by Monday I had four clients asking why their summaries were dropping key facts. Four. And I didn't know because I had no baseline. No test set. No way to compare the new prompt against the old one except reading outputs one at a time in a café in Lisbon.

Kira: The vibes-based evaluation method.

Santi: The vibes-based evaluation method. Which works until it doesn't. And when it doesn't, you find out from your customers.

Kira: If you're running AI workflows in production right now — client-facing, revenue-generating workflows — and you don't have a way to test whether a prompt change or a model deprecation just broke something, you are shipping regressions to paying customers. That's not a quality problem. That's a churn problem. And it's invisible until the damage is done.

Santi: Today we're building the fix. A minimal evals loop you can stand up this week — golden test sets for three output types, AI judge prompts that actually work, pairwise A/B testing for prompt changes, and a Monday scorecard that ties quality to cost so you know exactly what you're shipping and what it's costing you. Thirty minutes a week to keep it running.

Kira: Okay so before we build anything — why does this matter specifically for us? For people running businesses from laptops in different cities every month?

Santi: Because the failure mode is silent. If your website goes down, you get an alert. If your Stripe integration breaks, payments fail and you know immediately. But if your LLM starts producing worse outputs — slightly less accurate summaries, slightly off-tone emails, JSON fields that are almost right but not quite — nobody tells you. The model doesn't throw an error. It just gets worse.

Kira: And you're asleep.

Santi: And you're asleep, or you're on a twelve-hour bus in Peru, or you're doing a visa run in Bangkok. Meanwhile your content repurposing tool is shipping summaries that drop key facts, and your clients are noticing before you do.

Kira: This happened to someone in my Slack community last month. She runs an AI writing service — seven contractors, four continents. Anthropic adjusted their Claude pricing, she rolled forward to a newer snapshot without testing, and the tone of her client emails shifted. Not dramatically — just enough that two enterprise clients flagged it in the same week.

Santi: And she didn't have a baseline to compare against.

Kira: Nothing. She was reading outputs manually. Which, when you're managing thirty-plus client accounts across time zones, means you're reading maybe five percent of what ships.

Santi: Right. So the question isn't whether you need LLM evaluation. The question is what's the smallest version that actually works. Three pieces — a golden test set, an AI judge, and a way to compare prompt A versus prompt B.

Kira: Golden sets first. Because I think people hear "test set" and imagine hundreds of cases and weeks of labeling.

Santi: Fifteen to twenty cases per output type. You're not building a benchmark — you're building a regression detector. Pick your most common output types. For me that's email rewrites, JSON extraction from invoices, and content summaries. Each case is an input paired with a known-good output, stored as JSONL — one line per case. Tag them so you can slice later. Formal tone, has PII, Spanish language, whatever.

Kira: And you pull these from real production data, not synthetic examples?

Santi: Always real data. Because synthetic test cases test synthetic problems. Your real edge cases — the invoice with a weird date format, the email where the client used slang your model misreads — those are the ones that break in production. You want the hard stuff in your golden set, not just the clean stuff.

Kira: Okay. Fifteen to twenty cases per type, tagged, JSONL. Now what — you just eyeball the results?

Santi: No. That's the vibes method again. This is where the AI judge comes in — the G-Eval pattern. Liu and colleagues published this at EMNLP in twenty twenty-three. You give a second model — different family from your generator — a rubric and the output, and you ask it to analyze first, then score. Not just "rate this one to five." Cite specific lines, then assign scores per dimension.

Kira: So for an email rewrite, the rubric might be — instruction-following, tone fit, clarity, and a hard-fail check for PII leaks?

Santi: Exactly those four. Each scored zero to one. And the key finding from G-Eval is that this rubric-guided approach aligns better with human ratings than classic metrics like ROUGE or BERTScore. Those old metrics just measure word overlap — they can't tell you if a summary is actually faithful to the source.

Kira: Okay but — and this is the important part — the judge is also an LLM. So how do you trust it?

Santi: You don't. Not blindly.

Kira: Wait, you just told me to use an AI judge and now you're saying don't trust it?

Santi: Trust it as triage, not as truth. There's a twenty twenty-four paper — Wataoka and colleagues — that documents self-preference bias in LLM judges. If your generator is GPT-4o and your judge is also GPT-4o, the judge will systematically rate those outputs higher than equivalent outputs from Claude or Gemini. It's measurable.

Kira: So the judge is biased toward its own family.

Santi: Yes. And a twenty twenty-six ICLR paper goes further — preference leakage. Even related model families can inflate each other's scores. So rule one — cross-family judges. Generate with OpenAI, judge with Anthropic. Or vice versa. Rule two — blind the order on pairwise A/B comparisons. Randomize which output the judge sees first. There's documented position bias — judges prefer whatever they see first, or whatever is longer.

Kira: This is the Chatbot Arena pattern. LMSYS has been doing this with human raters since twenty twenty-three — anonymous, randomized, pairwise votes.

Santi: Same principle. We're just doing it locally with an AI judge instead of thousands of human voters.

Kira: But the human voters are the ground truth. An AI judge is an approximation. So where's the check on the check?

Santi: Human sampling. Ten to twenty percent of your live production jobs, reviewed by a human every week. And you focus that sample — pull the ones where the judge was least confident. Borderline scores, ties in pairwise comparisons. That's where the judge is most likely to be wrong, and that's where your human attention has the highest return.

Kira: Right, so you're not replacing human review — you're focusing it. Going from "review everything" to "review the fifteen jobs where the judge was uncertain."

Santi: That's the difference between a full-time QA person and thirty minutes on Monday morning.

Kira: Okay, so we've got golden sets, a rubric judge with cross-family separation and blind order, and human sampling on the borderlines. Now — how does this actually run? Because I'm imagining someone thinking, "great, another system I have to manually trigger every time I change a prompt."

Santi: No. This runs in CI. You push a prompt change, the eval suite runs automatically, and if your pass rate drops below your threshold, the build fails. You can't ship the regression.

Kira: What tool?

Santi: Promptfoo. Open source, lightweight CLI. YAML config pointing at your golden sets and judge prompts, pass-rate thresholds — ninety percent for email rewrites, eighty-eight for summarization — wired into GitHub Actions. Non-zero exit code if quality regresses. Your PR gets blocked.

Kira: So the regression gate is automated. You don't have to remember to run evals.

Santi: You don't have to remember anything. That's the Lisbon Test for this whole system — can it run without you being awake? The CI gate runs on every pull request. Golden sets live in your repo. Judge prompts live in your repo. Push a change, the system tells you if it's safe to ship.

Kira: And the Monday scorecard — that's the human layer on top?

Santi: Right. Every Monday, thirty minutes. Six numbers. Pass rate per output type. Win rate from your pairwise A/Bs. P ninety-five latency. Cost per hundred jobs. Judge agreement with your human sample. And incidents — anything that broke during the week.

Kira: Walk me through cost per hundred jobs.

Santi: Both OpenAI and Anthropic return token usage on every API call. You multiply by the per-token price, sum it up. My content repurposing tool runs about four cents per job for generation and one point two cents per judge call. So cost per hundred jobs is five dollars twenty cents. That number goes on the scorecard. If it jumps, you see it Monday morning —

Kira: Before it compounds for a month.

Santi: Before it compounds for a month. I lost three weeks to an Anthropic pricing change last year because I wasn't tracking per-job cost. Never again.

Kira: And the decision row — roll forward, hold, or roll back?

Santi: If pass rates are stable and costs are in line, roll forward. If something dipped, hold and investigate. If a model deprecation broke your judge or generator, roll back to the last known-good snapshot. Keep your last two working model versions pinned in your environment variables so rollback is a one-line change.

Kira: Not a "what were we running three weeks ago" Slack thread.

Santi: Not a Slack thread. A pinned config.

Kira: Okay. So the weekly rhythm is — Friday, add three to five fresh cases from that week's production traces. Sunday, open a PR with any prompt or model changes and let CI run. Monday, fill in the scorecard, make the decision, assign one action item. Daily, alerts on latency and cost thresholds catch spikes before the weekly review.

Santi: That's it. Monthly — refresh golden sets, close stale failures, recalibrate your judge if agreement drops below target.

Kira: What's a reasonable target for judge-human agreement?

Santi: Ninety-five percent. If the judge and my human reviewers disagree on more than five percent of sampled cases for two consecutive weeks, I tighten my rubric prompts or swap the judge model. The research doesn't give us a hard number here — the ten to twenty percent sampling rate is practitioner consensus, not a peer-reviewed standard — so calibrate to your own tolerance.

Kira: Okay, ninety-five percent. Got it. But I want to push back on something. Someone listening who runs a service business — not a product, a service — might be thinking, "I don't have three clean output types. My workflows are messy. My clients all want different things."

Santi: Pick one. The one output type that ships the most volume. Build fifteen golden-set cases for that one type. Wire up one judge prompt. Run it for two weeks. You'll catch things you didn't know were breaking.

Kira: Start with one. Not three.

Santi: Start with one. Three is the full version. One is the version that fits in an afternoon and still saves you from the Monday morning client email that says "something changed."

Kira: So — the vibes-based evaluation method.

Santi: Yeah.

Kira: That's where most of us are right now. Reading a handful of outputs, feeling good about a prompt change, shipping it, and hoping nothing broke while we're offline. And the thing that gets me is — it's not laziness. It's that we didn't know how small the fix could be. Fifteen golden-set cases. One rubric judge. One CI gate. Thirty minutes on Monday. That's the whole system.

Santi: And it compounds. Every week you add a few fresh cases, your golden set gets sharper. Every Monday you fill in the scorecard, you build a history of what your system actually does under real conditions. Three months from now you've got a quality record that lets you swap models, change prompts, or onboard a new client with confidence — because you can prove the outputs hold up.

Kira: Not "I think it's fine." Here are the pass rates.

Santi: Here are the pass rates, here's the cost, here's the latency. That's the conversation you want to have with a client. Not "trust me."

Kira: We put together a starter kit for this — it's on the Resources page. JSONL golden-set templates for all three output types, the GEval-style judge prompts, a pairwise A/B judge prompt, the Promptfoo CI config, and the Monday scorecard as a Notion template you can duplicate. Plus a deprecations checklist so you don't get caught by a model retirement.

Santi: Your one action this week — pick your highest-volume output type, pull fifteen real examples from your production data, and build your first golden set. That's it. Don't try to do all three. Don't build the CI gate yet. Just get fifteen cases into a JSONL file and run one judge prompt against them. You'll know within an hour whether your current outputs are as good as you think they are.

Kira: And if they're not — now you know before your clients do.

Santi: See you Wednesday.

Kira: See you Wednesday.

LLM evaluationAI quality assuranceprompt testingregression detectionnomad business operationsAI judge promptsgolden test setsCI/CD automationcost trackingmodel deprecation