Episode 14·

Build a QA Wall That Catches Bad AI Outputs Before Clients See Them

Intro

This episode is for AI content agencies and automation consultants who deliver work asynchronously across time zones. You'll learn how to build a systematic quality control process that prevents client-facing failures without becoming a bottleneck yourself.

In This Episode

Kira walks through the three-layer QA Wall she built after a fabricated citation reached a client's 11,000 LinkedIn followers. Layer one uses golden-set regression tests to catch prompt and model drift before any client work ships. Layer two runs deterministic checks (word count, PII detection, schema validation) plus dual LLM judges with explicit rubrics on every deliverable. Layer three strategically samples 10-20% of outputs for human review, with mathematical detection probabilities and automatic batch pausing when quality drops. The system includes spend guardrails that cap costs per job and per client, plus an offline-first SQLite queue that works from sketchy café wifi.

Key Takeaways

  • Golden-set tests with 20 diverse examples catch model drift and prompt changes before clients see broken outputs
  • Dual LLM judges with explicit rubrics and rationale requirements create better quality gates than single-judge systems
  • Strategic 10-20% human sampling with detection probability math prevents quality issues while controlling review costs

Timestamps

Companion Resource

  • Liu et al., EMNLP 2023 (G‑Eval)

    aclanthology.org

    • - G‑Eval proposes rubric‑guided, chain‑of‑thought scoring with GPT‑4 and reports stronger alignment with human judgments than earlier automatic metrics for tasks like summarization and dialogue.
  • Chiang & Lee 2023, "A Closer Look into Automatic Evaluation Using LLMs"

    arxiv.org

    • - Subsequent analyses show details of prompting and requiring explanations can significantly improve LLM‑as‑judge correlation with humans, but reliability varies by task and setup.
  • "How Reliable is Multilingual LLM‑as‑a‑Judge?" Findings of EMNLP 2025

    aclanthology.org

    • - Recent research highlights reliability and bias concerns with LLM‑as‑judge across languages and settings, advising caution and multi‑rater/aggregation strategies.
  • Acceptance sampling (overview)

    en.wikipedia.org

    • - Acceptance sampling provides a mathematical basis for choosing sample sizes to detect defect rates with target confidence; detection probability ≈ 1 − (1 − p)^n for large lots under binomial assumptions.
  • OpenAI Help Center — Controlling the length of model responses

    help.openai.com

    • - OpenAI’s Responses API supports hard output‑length controls via max_output_tokens, enabling per‑job token caps at the request level.
  • Anthropic Docs — Rate limits and Spend limits

    platform.claude.com

    • - Anthropic’s Claude API documents organization‑level spend limits and rate limits, including customer‑set monthly spend caps and tier ceilings.
  • Slack Developer Docs — Incoming Webhooks

    docs.slack.dev

    • - Slack Incoming Webhooks accept JSON payloads (including Block Kit) to post alerts into channels without requiring a bot token.
  • OWASP Validation Regex Repository

    owasp.org

    • - OWASP maintains a validation regex repository, including patterns for credit cards and U.S. phone numbers, useful for lightweight PII heuristics before model‑based checks.
  • AWS Comprehend — Detecting PII entities

    docs.aws.amazon.com

    • - Amazon Comprehend provides PII detection/redaction with a taxonomy of universal entities (emails, credit cards, etc.) and confidence scores.
  • OpenAI Evals docs (GitHub)

    github.com

    • - OpenAI’s public Evals resources outline building datasets and eval classes, reinforcing the practice of testing against an explicit golden set on each change.
  • LangSmith Evaluation quickstart/docs

    docs.langchain.com

    • - LangSmith supports datasets, LLM‑as‑judge evaluators, human annotation queues, and pairwise comparisons—covering automated and human QA in one loop.
  • Fly.io — Introducing LiteFS

    fly.io

    • - SQLite + LiteFS replication is positioned for edge/offline scenarios; durable local writes and later sync let jobs queue and retry without data loss.
  • LangSmith Docs — Evaluation quickstart

    docs.langchain.com

    • - LangSmith evaluation pipeline
    • - Demonstrates a practical way to run golden-set tests with versioned datasets and LLM‑as‑judge scoring; maps directly to Layer 1 and Layer 2 of the QA Wall.
  • AWS Blog + A2I documentation

    aws.amazon.com

    • - Amazon A2I (Augmented AI) with Translate/Textract
    • - A named, production‑grade pattern for triggering human review when confidence is low; this is Layer 3 of the QA Wall.
  • Fly.io engineering blog (LiteFS + SQLite)

    fly.io

    • - LiteFS and SQLite for reliable edge/offline operation
    • - Supports the offline‑first queue requirement and safe retries so QA can run even without stable connectivity.

Kira: Forty-seven deliverables a week. That's what my agency pushes through AI workflows right now — blog posts, LinkedIn drafts, case study summaries, email sequences. Forty-seven pieces of content, seven contractors, four continents.

Santi: And how many of those get a human eye before they hit the client?

Kira: ...Until three months ago? Maybe five.

Santi: Five out of forty-seven.

Kira: Five out of forty-seven. And the other forty-two just... shipped. Whatever the model produced, whatever the contractor approved with a thumbs-up emoji in Slack — that's what the client got.

Santi: Thumbs-up emoji QA.

Kira: That was the system. And it worked — until it didn't. One Tuesday in February, a client forwarded me a blog post we'd delivered. It cited a study that doesn't exist. Fabricated author, fabricated journal, fabricated year. And she'd already published it.

Santi: How long before she caught it?

Kira: She didn't catch it. Her reader caught it. In the comments. On LinkedIn. In front of eleven thousand followers.

Santi: Oh no.

Kira: That client's still with me — barely. But the conversation we had that week changed how I build everything. Because the problem wasn't the model hallucinating. Models hallucinate. The problem was that nothing between the model and the client's inbox was checking.

Kira: The thing that separates AI agencies that keep clients from AI agencies that churn them isn't output quality. It's whether bad output ever reaches the client at all. That's what a QA Wall does — it catches the junk before anyone outside your team sees it, and it caps your costs so a runaway model doesn't drain your budget while you're on a flight to Medellín.

Santi: Three layers. Golden-set tests, automated judges, human sampling. Plus spend guards. You'll have the whole system by the end of this episode — and the template to wire it up this week.

Santi: So after the fabricated-citation incident — walk me through what you actually did. Because I know you didn't just start reading every deliverable yourself.

Kira: No. That's the trap, right? The instinct is to go full manual. Review everything. But forty-seven deliverables a week — if I'm spending even eight minutes per piece, that's over six hours. I'm back to being a freelancer, not running an agency.

Santi: And you're the bottleneck again. Which is the whole reason you built the AI workflows in the first place.

Kira: Exactly. So instead of reviewing everything, I built a wall. Three layers, each one catching different kinds of failures, and the whole thing runs before any deliverable leaves our system.

Santi: And this is the AI content QA workflow we're breaking down today. Layer one is the part most people skip entirely — the golden set.

Kira: Right. So imagine you're updating a prompt, or you swap from Claude to GPT-4o because the pricing changed, or Anthropic pushes a model update that subtly shifts tone. Any of those changes can break your output in ways you won't notice for days.

Santi: Unless you have a regression test.

Kira: Unless you have a regression test. And that's all a golden set is — twenty items that represent the range of work you deliver. A LinkedIn post, a case study summary, an FAQ answer, a product changelog, a meta description. Each one has the task, the input, and the expected properties. Not the exact expected output — the properties. Brand tone, word count range, must-include elements, must-avoid elements.

Santi: So you're not checking for an exact match. You're checking that the output still behaves the way it should.

Kira: Right. And you version it. Every time you change a prompt or swap a model, you run the golden set first. If something breaks — if your LinkedIn posts suddenly come back at two hundred and fifty words instead of one-twenty to one-fifty, or your case study summaries drop the numeric outcome — you catch it before a single client sees it.

Santi: OpenAI's Evals framework does exactly this. LangSmith too — you define a dataset, attach evaluators, and run it on every deploy. Statsig has written about this from the experimentation side — they stress that your golden set needs diversity and decontamination. You can't just test the easy cases.

Kira: Which is what I did at first. My first golden set was twelve items and they were all blog posts. Passed every time. Felt great. Then a client's FAQ answers started coming back with hallucinated product features and my golden set didn't catch it because I'd never tested FAQ answers.

Santi: Classic. You tested what you were good at, not what was likely to break.

Kira: So now it's twenty items across every content type we deliver. And this is the important part — the golden set runs in CI. It's automated. I don't touch it. If it fails, the deploy doesn't go through.

Santi: Okay, so layer one catches drift when you change something. But what about the forty-seven jobs running every week where nothing changed on your end — the model just had a bad day?

Kira: That's layer two. Every single job gets checked before it ships. And it's two stages — deterministic checks first, then an LLM judge.

Santi: The deterministic checks are the cheap ones. And honestly, they catch more than you'd think. Word count — is this blog post actually in the one-twenty to one-fifty range, or did the model spit out three hundred words? Schema validation — if you asked for JSON, is it valid JSON? URL checks — are the links pointing to allowed domains, or did the model invent a URL?

Kira: You also run PII checks at this stage, right?

Santi: Yeah, and this is where it gets serious. OWASP publishes vetted regex patterns for credit card numbers, US phone numbers, email addresses. You run those first — they're instant, they cost nothing. Then if you want a second pass, AWS Comprehend does model-based PII detection with confidence scores. Set your threshold at ninety percent confidence and auto-fail anything that trips it.

Kira: One of my contractors accidentally left a client's personal email in a draft last month. The regex caught it. Took zero seconds, cost zero dollars. That's the kind of failure an LLM judge might actually miss.

Santi: Which is why you run heuristics first. They're fast, they're deterministic, they never have a bad day. The LLM judge comes after.

Kira: So the judge — and this is based on research from Liu and colleagues, the G-Eval paper from EMNLP twenty twenty-three — you give the model a rubric with explicit scoring dimensions. Faithfulness — are all claims supported by the sources you provided? Relevance — does this actually answer the brief? Style and tone — does it match the brand voice?

Santi: Each dimension scored one to five. Pass threshold is an average of four-point-oh or higher, and Faithfulness can't drop below four. If the model invents a citation — like the one that almost cost you that client — Faithfulness tanks and the job gets flagged.

Kira: And you require rationale. The judge has to cite specific lines when it deducts points. That's what Chiang and Lee showed in twenty twenty-three — requiring explanations significantly improves how well LLM judges correlate with human ratings.

Santi: I run two judges. A fast, cheap model and the primary model. If their average scores disagree by more than half a point, the job routes to a human automatically. No one has to make a decision — the system escalates on its own.

Kira: Dual judges. I hadn't done that until you showed me. I was running a single judge and wondering why some garbage still slipped through.

Santi: One judge is a filter. Two judges are a wall.

Kira: So layers one and two are fully automated. Layer three is where humans come back in — but strategically.

Santi: Not reviewing everything. Sampling.

Kira: Sampling. And there's actual math behind how much you need to sample. The formula is simple — detection probability equals one minus the quantity one minus your expected error rate, raised to the power of your sample size. So if you think five percent of outputs have escaped errors and you sample twenty out of a hundred jobs, your probability of catching at least one bad one is about sixty-four percent.

Santi: Sixty-four percent doesn't sound great.

Kira: It's not. That's why I sample at twenty percent for new clients and ten percent for established ones. At twenty percent sampling with a ten percent error rate, you're catching at least one defect eighty-eight percent of the time.

Santi: And the sampling is random?

Kira: Random, rotating reviewer, and I get a Slack alert for every sampled job. The reviewer has a simple pass-fail form. If they fail it, the whole batch pauses and I get a second alert.

Santi: Wait — the whole batch?

Kira: The whole batch for that client. Because if one job failed human review, the error rate assumption just changed. I'd rather pause and re-check than let three more bad deliverables through while I'm asleep.

Santi: That's actually smart. I would have just flagged the individual job and kept going.

Kira: And that's how you end up with the fabricated-citation problem. One bad output is a signal, not an isolated event.

Santi: Okay, the other half of this — because QA isn't free. Every LLM judge call costs tokens. Every retry costs tokens. And if you're not capping that, you can wake up to a bill that eats your margins.

Kira: This happened to you, right?

Santi: March. I had a retry loop that got stuck — a job kept failing the judge, retrying, failing, retrying. Burned through about forty dollars in tokens on a single deliverable that was worth maybe two hundred to the client. Twenty percent of the job value gone on QA alone.

Kira: So what's the fix?

Santi: Two caps. Per-job — you set max output tokens on every API call. OpenAI's Responses API supports this natively. If the job exceeds the token cap or retries more than twice, it short-circuits to human review instead of burning more money. Per-client — you track monthly spend in a simple database. When a client hits eighty percent of their monthly cap, you get a Slack alert. Ninety percent, another alert. A hundred percent, the queue pauses automatically.

Kira: And the queue itself — this ties back to the offline-first stuff we covered a couple episodes ago. The whole QA system runs on SQLite. Jobs table, QA events table, client spend tracking. It works on your laptop with no internet. When you're back online, it syncs.

Santi: Fly dot io's LiteFS handles the replication piece if you want it on a server. But honestly, for most solo operators, a local SQLite file with a cron job is enough.

Kira: The Lisbon Test — can you run this from a café with sketchy wifi?

Santi: Passes. The QA wall runs locally. The Slack alerts queue up and fire when you reconnect. The spend caps are enforced locally because the database is local. The only thing that needs internet is the LLM judge calls, and those retry with exponential backoff.

Kira: Okay, I want to push on something. Because there's a real objection here. We're using an LLM to check the output of an LLM. A paper from EMNLP twenty twenty-five — "How Reliable is Multilingual LLM-as-a-Judge" — found significant reliability gaps. Bias, inconsistency across languages, across task types. So aren't we just adding a second failure mode?

Santi: Yeah. If the judge were the only layer, that would be a real problem. And honestly, for the first two weeks I ran a single LLM judge with no heuristics, I had a false sense of security. Stuff was passing that shouldn't have.

Kira: So the judge is triage, not truth.

Santi: Exactly. The judge is triage. The heuristics catch the mechanical failures — PII, word count, broken JSON, hallucinated URLs. The golden set catches drift. The human sample catches whatever the machines miss. The judge sits in the middle and handles the subjective stuff — tone, faithfulness, relevance — where it's actually pretty good when you give it a rubric and force it to explain itself.

Kira: And the dual-judge agreement requirement means if the two models disagree, a human decides. You're not trusting any single model's judgment.

Santi: No single layer is the safety net. The wall is the safety net. That's the whole architecture — redundancy at every level, same principle as the failover system we built in episode eight. No single point of failure.

Kira: Antifragile QA.

Santi: Don't start.

Kira: So here's where I keep coming back to. That blog post — the fabricated citation, the LinkedIn comments, the client call where I had to explain how it happened. If the QA Wall had existed then, layer two catches it. The faithfulness check flags an unsourced claim, the judge scores it below four, the job routes to a human reviewer before it ever leaves our system. My client never publishes it. Her readers never see it. And I don't spend three weeks rebuilding trust.

Santi: That's the whole argument. You don't build QA because you think your AI is bad. You build it because the one time it is bad, you need something between the model and the client that says no.

Kira: And it doesn't have to be complicated. Twenty golden-set items in a YAML file. A handful of regex checks and a judge rubric. A ten percent human sample with Slack alerts. Two spend caps. That's the system.

Santi: If you want to skip the setup from scratch, we put together the QA Wall Starter Pack — it's on the Resources page. The golden set template, the judge rubric, the heuristics file, the Slack alert payloads, the SQLite schema, and the Python guardrails code. It's the same system we walked through today, ready to drop into your repo.

Kira: Your one thing this week — stand up layer one. Just the golden set. Twenty items, run it once manually against your current prompts. See what breaks. You'll be surprised.

Santi: And when something breaks, you'll be glad you caught it before your client did.

Kira: See you Wednesday.

Santi: See you Wednesday.

AI content QAquality controlgolden set testingLLM judgeshuman samplingspend guardrailsoffline-firstclient retentionautomation workflowsnomad business operations