Kira: Forty-seven deliverables a week. That's what my agency pushes through AI workflows right now — blog posts, LinkedIn drafts, case study summaries, email sequences. Forty-seven pieces of content, seven contractors, four continents.
Santi: And how many of those get a human eye before they hit the client?
Kira: ...Until three months ago? Maybe five.
Santi: Five out of forty-seven.
Kira: Five out of forty-seven. And the other forty-two just... shipped. Whatever the model produced, whatever the contractor approved with a thumbs-up emoji in Slack — that's what the client got.
Santi: Thumbs-up emoji QA.
Kira: That was the system. And it worked — until it didn't. One Tuesday in February, a client forwarded me a blog post we'd delivered. It cited a study that doesn't exist. Fabricated author, fabricated journal, fabricated year. And she'd already published it.
Santi: How long before she caught it?
Kira: She didn't catch it. Her reader caught it. In the comments. On LinkedIn. In front of eleven thousand followers.
Santi: Oh no.
Kira: That client's still with me — barely. But the conversation we had that week changed how I build everything. Because the problem wasn't the model hallucinating. Models hallucinate. The problem was that nothing between the model and the client's inbox was checking.
Kira: The thing that separates AI agencies that keep clients from AI agencies that churn them isn't output quality. It's whether bad output ever reaches the client at all. That's what a QA Wall does — it catches the junk before anyone outside your team sees it, and it caps your costs so a runaway model doesn't drain your budget while you're on a flight to Medellín.
Santi: Three layers. Golden-set tests, automated judges, human sampling. Plus spend guards. You'll have the whole system by the end of this episode — and the template to wire it up this week.
Santi: So after the fabricated-citation incident — walk me through what you actually did. Because I know you didn't just start reading every deliverable yourself.
Kira: No. That's the trap, right? The instinct is to go full manual. Review everything. But forty-seven deliverables a week — if I'm spending even eight minutes per piece, that's over six hours. I'm back to being a freelancer, not running an agency.
Santi: And you're the bottleneck again. Which is the whole reason you built the AI workflows in the first place.
Kira: Exactly. So instead of reviewing everything, I built a wall. Three layers, each one catching different kinds of failures, and the whole thing runs before any deliverable leaves our system.
Santi: And this is the AI content QA workflow we're breaking down today. Layer one is the part most people skip entirely — the golden set.
Kira: Right. So imagine you're updating a prompt, or you swap from Claude to GPT-4o because the pricing changed, or Anthropic pushes a model update that subtly shifts tone. Any of those changes can break your output in ways you won't notice for days.
Santi: Unless you have a regression test.
Kira: Unless you have a regression test. And that's all a golden set is — twenty items that represent the range of work you deliver. A LinkedIn post, a case study summary, an FAQ answer, a product changelog, a meta description. Each one has the task, the input, and the expected properties. Not the exact expected output — the properties. Brand tone, word count range, must-include elements, must-avoid elements.
Santi: So you're not checking for an exact match. You're checking that the output still behaves the way it should.
Kira: Right. And you version it. Every time you change a prompt or swap a model, you run the golden set first. If something breaks — if your LinkedIn posts suddenly come back at two hundred and fifty words instead of one-twenty to one-fifty, or your case study summaries drop the numeric outcome — you catch it before a single client sees it.
Santi: OpenAI's Evals framework does exactly this. LangSmith too — you define a dataset, attach evaluators, and run it on every deploy. Statsig has written about this from the experimentation side — they stress that your golden set needs diversity and decontamination. You can't just test the easy cases.
Kira: Which is what I did at first. My first golden set was twelve items and they were all blog posts. Passed every time. Felt great. Then a client's FAQ answers started coming back with hallucinated product features and my golden set didn't catch it because I'd never tested FAQ answers.
Santi: Classic. You tested what you were good at, not what was likely to break.
Kira: So now it's twenty items across every content type we deliver. And this is the important part — the golden set runs in CI. It's automated. I don't touch it. If it fails, the deploy doesn't go through.
Santi: Okay, so layer one catches drift when you change something. But what about the forty-seven jobs running every week where nothing changed on your end — the model just had a bad day?
Kira: That's layer two. Every single job gets checked before it ships. And it's two stages — deterministic checks first, then an LLM judge.
Santi: The deterministic checks are the cheap ones. And honestly, they catch more than you'd think. Word count — is this blog post actually in the one-twenty to one-fifty range, or did the model spit out three hundred words? Schema validation — if you asked for JSON, is it valid JSON? URL checks — are the links pointing to allowed domains, or did the model invent a URL?
Kira: You also run PII checks at this stage, right?
Santi: Yeah, and this is where it gets serious. OWASP publishes vetted regex patterns for credit card numbers, US phone numbers, email addresses. You run those first — they're instant, they cost nothing. Then if you want a second pass, AWS Comprehend does model-based PII detection with confidence scores. Set your threshold at ninety percent confidence and auto-fail anything that trips it.
Kira: One of my contractors accidentally left a client's personal email in a draft last month. The regex caught it. Took zero seconds, cost zero dollars. That's the kind of failure an LLM judge might actually miss.
Santi: Which is why you run heuristics first. They're fast, they're deterministic, they never have a bad day. The LLM judge comes after.
Kira: So the judge — and this is based on research from Liu and colleagues, the G-Eval paper from EMNLP twenty twenty-three — you give the model a rubric with explicit scoring dimensions. Faithfulness — are all claims supported by the sources you provided? Relevance — does this actually answer the brief? Style and tone — does it match the brand voice?
Santi: Each dimension scored one to five. Pass threshold is an average of four-point-oh or higher, and Faithfulness can't drop below four. If the model invents a citation — like the one that almost cost you that client — Faithfulness tanks and the job gets flagged.
Kira: And you require rationale. The judge has to cite specific lines when it deducts points. That's what Chiang and Lee showed in twenty twenty-three — requiring explanations significantly improves how well LLM judges correlate with human ratings.
Santi: I run two judges. A fast, cheap model and the primary model. If their average scores disagree by more than half a point, the job routes to a human automatically. No one has to make a decision — the system escalates on its own.
Kira: Dual judges. I hadn't done that until you showed me. I was running a single judge and wondering why some garbage still slipped through.
Santi: One judge is a filter. Two judges are a wall.
Kira: So layers one and two are fully automated. Layer three is where humans come back in — but strategically.
Santi: Not reviewing everything. Sampling.
Kira: Sampling. And there's actual math behind how much you need to sample. The formula is simple — detection probability equals one minus the quantity one minus your expected error rate, raised to the power of your sample size. So if you think five percent of outputs have escaped errors and you sample twenty out of a hundred jobs, your probability of catching at least one bad one is about sixty-four percent.
Santi: Sixty-four percent doesn't sound great.
Kira: It's not. That's why I sample at twenty percent for new clients and ten percent for established ones. At twenty percent sampling with a ten percent error rate, you're catching at least one defect eighty-eight percent of the time.
Santi: And the sampling is random?
Kira: Random, rotating reviewer, and I get a Slack alert for every sampled job. The reviewer has a simple pass-fail form. If they fail it, the whole batch pauses and I get a second alert.
Santi: Wait — the whole batch?
Kira: The whole batch for that client. Because if one job failed human review, the error rate assumption just changed. I'd rather pause and re-check than let three more bad deliverables through while I'm asleep.
Santi: That's actually smart. I would have just flagged the individual job and kept going.
Kira: And that's how you end up with the fabricated-citation problem. One bad output is a signal, not an isolated event.
Santi: Okay, the other half of this — because QA isn't free. Every LLM judge call costs tokens. Every retry costs tokens. And if you're not capping that, you can wake up to a bill that eats your margins.
Kira: This happened to you, right?
Santi: March. I had a retry loop that got stuck — a job kept failing the judge, retrying, failing, retrying. Burned through about forty dollars in tokens on a single deliverable that was worth maybe two hundred to the client. Twenty percent of the job value gone on QA alone.
Kira: So what's the fix?
Santi: Two caps. Per-job — you set max output tokens on every API call. OpenAI's Responses API supports this natively. If the job exceeds the token cap or retries more than twice, it short-circuits to human review instead of burning more money. Per-client — you track monthly spend in a simple database. When a client hits eighty percent of their monthly cap, you get a Slack alert. Ninety percent, another alert. A hundred percent, the queue pauses automatically.
Kira: And the queue itself — this ties back to the offline-first stuff we covered a couple episodes ago. The whole QA system runs on SQLite. Jobs table, QA events table, client spend tracking. It works on your laptop with no internet. When you're back online, it syncs.
Santi: Fly dot io's LiteFS handles the replication piece if you want it on a server. But honestly, for most solo operators, a local SQLite file with a cron job is enough.
Kira: The Lisbon Test — can you run this from a café with sketchy wifi?
Santi: Passes. The QA wall runs locally. The Slack alerts queue up and fire when you reconnect. The spend caps are enforced locally because the database is local. The only thing that needs internet is the LLM judge calls, and those retry with exponential backoff.
Kira: Okay, I want to push on something. Because there's a real objection here. We're using an LLM to check the output of an LLM. A paper from EMNLP twenty twenty-five — "How Reliable is Multilingual LLM-as-a-Judge" — found significant reliability gaps. Bias, inconsistency across languages, across task types. So aren't we just adding a second failure mode?
Santi: Yeah. If the judge were the only layer, that would be a real problem. And honestly, for the first two weeks I ran a single LLM judge with no heuristics, I had a false sense of security. Stuff was passing that shouldn't have.
Kira: So the judge is triage, not truth.
Santi: Exactly. The judge is triage. The heuristics catch the mechanical failures — PII, word count, broken JSON, hallucinated URLs. The golden set catches drift. The human sample catches whatever the machines miss. The judge sits in the middle and handles the subjective stuff — tone, faithfulness, relevance — where it's actually pretty good when you give it a rubric and force it to explain itself.
Kira: And the dual-judge agreement requirement means if the two models disagree, a human decides. You're not trusting any single model's judgment.
Santi: No single layer is the safety net. The wall is the safety net. That's the whole architecture — redundancy at every level, same principle as the failover system we built in episode eight. No single point of failure.
Kira: Antifragile QA.
Santi: Don't start.
Kira: So here's where I keep coming back to. That blog post — the fabricated citation, the LinkedIn comments, the client call where I had to explain how it happened. If the QA Wall had existed then, layer two catches it. The faithfulness check flags an unsourced claim, the judge scores it below four, the job routes to a human reviewer before it ever leaves our system. My client never publishes it. Her readers never see it. And I don't spend three weeks rebuilding trust.
Santi: That's the whole argument. You don't build QA because you think your AI is bad. You build it because the one time it is bad, you need something between the model and the client that says no.
Kira: And it doesn't have to be complicated. Twenty golden-set items in a YAML file. A handful of regex checks and a judge rubric. A ten percent human sample with Slack alerts. Two spend caps. That's the system.
Santi: If you want to skip the setup from scratch, we put together the QA Wall Starter Pack — it's on the Resources page. The golden set template, the judge rubric, the heuristics file, the Slack alert payloads, the SQLite schema, and the Python guardrails code. It's the same system we walked through today, ready to drop into your repo.
Kira: Your one thing this week — stand up layer one. Just the golden set. Twenty items, run it once manually against your current prompts. See what breaks. You'll be surprised.
Santi: And when something breaks, you'll be glad you caught it before your client did.
Kira: See you Wednesday.
Santi: See you Wednesday.