Santi: Two cents. Sixteen seconds. That's what it costs to have an LLM judge score one deliverable against a rubric — two cents and sixteen seconds. UW Health ran the comparison this year. Same evaluation, same rubric, done by a physician? Fifty dollars. Ten minutes.
Kira: Per case.
Santi: Per case. So you do the math on a thousand deliverables a week — which is not crazy if you're running content ops or support workflows for multiple clients. A thousand cases through a human reviewer? Fifty thousand dollars and a hundred sixty-six hours. Through a calibrated LLM judge? Twenty bucks. Four and a half hours of compute time — running in parallel while you sleep.
Kira: And you trust that?
Santi: Not blindly. No. That's the whole problem — people hear "LLM judge" and they either trust it completely or they dismiss it completely. Both are wrong.
Kira: Both will cost you clients.
Santi: Both will cost you clients.
Santi: Every AI deliverable you ship without a scored check is a bet — a bet that the model didn't drift, the prompt didn't break, and the output is still good enough to put your name on. Multiply that bet across a hundred jobs a week and you're not running a business. You're running a casino where the house edge works against you.
Kira: Today we're building the wall that kills that bet. Three layers — an LLM judge on every job, a golden-set replay that catches drift weekly, and a human sample with hard acceptance thresholds. You'll ship it this week. You'll run it in thirty minutes every Monday.
Kira: This happened to me three months ago. My AI task assignment system — the one that routes work to contractors across four continents — started sending complex briefs to the wrong people. Not obviously wrong. Just... subtly mismatched. And I didn't catch it for nine days because every output looked plausible on the surface. My contractor in Lagos caught it because she kept getting briefs outside her specialty and finally said something.
Santi: Nine days.
Kira: Nine days of misrouted work. And the only reason it wasn't worse is that she spoke up. That's not a system. That's luck.
Santi: And that's the failure mode nobody talks about. It's not a crash. It's not an error message. It's a five percent tone shift after a model update. A prompt that worked great in March starts hallucinating citations in May. Your support agent gets slightly more aggressive in its responses and nobody notices for two weeks.
Kira: Until a client notices.
Santi: Until a client notices. And by then you've shipped — what, a hundred, two hundred bad outputs? No record of when it started, no way to trace it back, and no way to prove to the client that you've fixed it.
Kira: So how do you catch that without hiring a full-time reviewer?
Santi: Three layers working together.
Santi: Layer one — every deliverable gets scored by an LLM judge against a weighted rubric before it ships. Every one. Five criteria, each weighted. Task fulfillment at thirty percent — did it follow the instructions. Factual accuracy at twenty-five. Clarity and structure at fifteen. Style and brand fit at ten. Citations at ten. And then a negative weight for safety flags — PII leakage, hallucinated claims, anything that could get you or your client in trouble.
Kira: And the thresholds?
Santi: Zero to five per criterion, weighted total. If the total is point eight or above, no critical flags, and the two heaviest criteria — task fulfillment and factual accuracy — both score four or higher, it ships as green. Automatically. No human touches it. Point seven to point eight, or any single criterion at two or below — amber, routes to a human edit queue. Below point seven or any critical flag — red. Blocked. Escalated.
Kira: Okay but you've got an AI grading AI. How is that not the same problem one layer deeper?
Santi: Because the rubric constrains the evaluation. This isn't "hey, is this output good?" — it's "score this output on these five specific dimensions using this exact JSON schema." The ICLR twenty twenty-six AutoMetrics study tested this directly. When you anchor a judge to a structured rubric, correlation with human ratings improves by up to thirty-three percent. And they got there with fewer than a hundred human feedback points.
Kira: Fewer than a hundred.
Santi: Fewer than a hundred. The AAAI Think-J paper showed the same thing from a different angle — judges forced to reason step by step through a rubric are more robust, even when the training data is noisy. So the rubric is doing most of the heavy lifting. The model is the engine, but the rubric is the steering.
Kira: So layer one catches bad individual outputs. What catches the judge itself going wrong?
Santi: Layer two. You build a golden set — forty to sixty items per output type, scored by humans, with agreed-upon labels and rationales. Every week, you replay the golden set through your judge and measure agreement. Cohen's kappa or Kendall's tau. Kappa above point six one is substantial agreement. You track it week over week, and when it drops — you pause auto-shipping and investigate.
Kira: The PLOS One longitudinal study did exactly this. Ten weeks, two hundred forty prompts, six domains, three model families. Weekly Bradley-Terry recalibration. They hit tau of point five nine to point six eight against blinded human raters — and they caught three different drift patterns. Some models stayed stable. Some improved. Some degraded.
Santi: And without the weekly replay, you'd never know which one you were on. You'd just be shipping and hoping.
Kira: Which is what most of us were doing six months ago.
Santi: Honestly? Yeah.
Kira: Okay but I want to push back here — because there's a twenty twenty-five paper showing that Bradley-Terry rankings can be sensitive to tiny data changes. Remove a few data points and the rankings flip. So how do you trust a system built on something brittle?
Santi: That's real. That's not a strawman. Pairwise rankings can be brittle. Two mitigations. First — randomize position. Run both orders, A-B and B-A, the way Chatbot Arena does it. Kills position bias, which is one of the biggest sources of noise. Second — pairwise isn't your primary gate. The rubric-scored pointwise evaluation is. Pairwise is your calibration tool for borderlines and for comparing prompt versions. Different jobs.
Kira: Rubric is the workhorse. Pairwise is the tiebreaker. Golden set is the canary.
Santi: And this passes the Lisbon Test. Sixty golden items at two cents each — a dollar twenty a week. Fifteen minutes on café wifi.
Kira: A dollar twenty and a coffee.
Santi: That's your weekly QA infrastructure cost.
Kira: Layer one scores everything. Layer two validates the scorer. Layer three is the human layer — and this is the part I care about most because this is what makes it client-auditable.
Santi: Walk me through how you'd pitch this to a client.
Kira: Imagine you're on an async Loom with a client who's nervous about AI-generated deliverables. They want to know their stuff is being checked. You say: every output gets scored by a calibrated judge against a six-criterion rubric. Top performers ship automatically. Borderlines get a human edit. And every week, we pull a five to ten percent sample and have a human reviewer score it independently against the same rubric. If the judge and the human disagree on more than ten percent of the green items, we pause and investigate. Here's the dashboard. It updates every Monday.
Santi: That's a sales asset.
Kira: It's a sales asset. It's a retention asset. And the Visa Run Revenue math is simple — your judge costs twenty dollars a week for a thousand items. Your human sample is fifty to a hundred items. Even at fifteen, twenty dollars an hour, your total QA cost is maybe three hundred fifty a week.
Santi: Versus fifty thousand for full human review.
Kira: If that three fifty keeps one client from churning, it's paid for itself for the quarter.
Santi: The sampling isn't random though, right?
Kira: No — and this is the important part. Half your sample is amber decisions — the borderlines the judge wasn't sure about. Thirty percent is high-risk greens — long outputs, safety-sensitive domains, new client styles the judge hasn't seen much of. Twenty percent is pure random greens, just to keep the judge honest. You're over-sampling exactly where failures hide. A random sample would waste most of your human budget confirming things the judge already got right.
Santi: And the escalation triggers?
Kira: Three levels. Green at the dashboard level means judge precision on greens is ninety-five percent or better, human disagreement under ten percent, no critical flags. Amber means one of those metrics slipped — raise the green cutline by point zero two, bump human sampling to fifteen percent for a week. Red means a critical safety event, or two-plus major misses in a fifty-item sample, or golden-set agreement crashing below kappa point five. Red means you stop shipping that output type until you've diagnosed it.
Santi: One more rule people skip — never let the same model family judge its own outputs on high-stakes work. If you're generating with GPT-4o, don't judge with GPT-4o. Use a different provider. Models have systematic blind spots to their own failure patterns. It's like proofreading your own essay — you read what you meant to write, not what you actually wrote.
Kira: The AI version of reading your own typos.
Santi: Exactly that.
Kira: All three layers feed into one view. Monday morning — wherever you are — five widgets. Volume and mix — how many items, what percentage green, amber, red. Judge health against the golden set with a four-week trend. Human QA metrics — precision, disagreement rate, sample size. Risk flags by type and resolution speed. And cost per eval.
Santi: Thirty minutes. Scan the state. Green, you're done. Amber, apply the playbook. Red, you already know because the escalation fired during the week.
Kira: And you log one improvement action with a due date. Every Monday. That's how the system gets better instead of just maintaining.
Santi: The whole thing — rubric, judge, golden set, human sample, dashboard — you can stand it up in a day. Not a week. Not a sprint. A focused day. And then it runs on thirty minutes a week plus whatever your human sampling costs.
Santi: So — two cents and sixteen seconds. That's where we started. And I think the mistake people make when they hear that number is they think the point is saving money. It's not. The point is that for two cents you can check every single thing you ship. Not a sample. Not the ones you're worried about. Everything. And that changes the conversation with your clients from "trust me" to "here's the dashboard."
Kira: That's the shift. You go from being a vendor who promises quality to a vendor who proves it — asynchronously, from wherever you happen to be, with artifacts a client can inspect on their own time. And if you want to skip building all of this from scratch, the QA Wall Kit is on the Resources page. It's the exact rubric template, the judge prompts for both rubric and pairwise modes, and the human sampling SOP with the red, amber, green thresholds we just walked through. Clone it and customize the weights for your output types.
Santi: One thing to do this week. Just one. Build your golden set. Pick forty items from your real output — a mix of good, borderline, and bad. Score them yourself. That's the foundation everything else sits on. You can add the judge and the dashboard later, but without the golden set, you're still guessing.
Kira: Schedule the Monday review. Put it on the calendar. Thirty minutes.
Santi: Thirty minutes and a coffee. We'll see you Wednesday.
Kira: See you Wednesday.