QA Wall Starter Pack (Template + Rubrics + Alerts)
Copy‑ready pack to stand up a three‑layer QA Wall: a 20‑item golden set, a strict LLM‑judge rubric, deterministic + PII heuristics, Slack alert payloads, a SQLite offline queue, and Python guardrails for per‑job and per‑client spend caps.
Copy this folder into your repo as /qa and wire it to your CI and job runner. It implements a three‑layer QA Wall: 1) golden‑set regression on every deploy, 2) per‑job heuristic + LLM‑judge checks, and 3) 10–20% human sampling with Slack alerts and spend caps. Replace [BRACKETS] with your details and run a dry test before enabling auto‑pause.
Quick start
- Put your 20‑item golden set in golden_set.yaml; version it per release.
- Point your judge to judge_rubric.md; require rationale + structured score output.
- Load heuristics.yml rules; start strict on PII and lenient on style for week one.
- Create a Slack Incoming Webhook and paste the URL into your secrets as [SLACK_WEBHOOK_URL].
- Create the SQLite tables from queue.sql (works offline; sync later if you use replication).
- Enforce per‑job and per‑client caps with guardrails.py; set alerts at 80/90/100% of [MONTHLY_SPEND_CAP_USD].
Folder layout
qa/
golden_set.yaml # 20-item dataset for Layer 1 regression
judge_rubric.md # Rubric + scoring schema for LLM-as-judge
heuristics.yml # Deterministic + regex + safety rules
slack_alerts.json # Block Kit payload examples
queue.sql # SQLite tables + indexes for offline queue
guardrails.py # Spend caps, routing, and Slack alerts
README.md # Paste the 'Quick start' from this template
golden_set.yaml — 20‑item starter set
Use as a working starting point. Keep ids stable; change only inputs/expectations as your product evolves. Reference outputs are exemplars, not the only valid answer — judges should use the rubric below to score new outputs against intent and sources.
# /qa/golden_set.yaml
version: [GOLDENSET_VERSION]
owner: [OWNER_EMAIL]
notes: |
Scope: [SCOPE_NOTE]. Each item includes task, input, expected_properties, a short exemplar, and checks.
items:
- id: G001
task: Generate a 120–150 word LinkedIn post from this brief; include one statistic and a CTA.
input: |
Brief: Why async ops beat meetings for cross‑timezone teams.
Source refs: [URL_1], [URL_2]
expected_properties:
brand_tone: pragmatic, B2B, no hype
must_include: ["one statistic with source", "CTA to book audit"]
must_avoid: ["ALL CAPS", "PII"]
reference_output: "[SHORT_ON_BRAND_EXEMPLAR]"
checks:
heuristics:
word_count_range: [120, 150]
max_caps_words: 0
max_exclamations: 1
urls_must_resolve: true
regex:
email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
judge_rubric_profile: default
- id: G002
task: Summarize a customer interview to 4 bullets with one numeric outcome.
input: Transcript excerpt (250–400 words)
expected_properties:
must_include: ["one metric"]
must_avoid: ["PII"]
reference_output: |
- Cut onboarding time by 43% after…
- …
checks:
heuristics: {bullet_count: 4, numbers_present: true}
regex:
credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
judge_rubric_profile: default
- id: G003
task: Write a product update changelog entry (max 90 words) with one benefit line.
input: Feature diff: [DIFF_SNIPPET]
expected_properties: {must_include: ["benefit"], must_avoid: ["dates older than [MIN_DATE]"]}
reference_output: "[EXAMPLE_CHANGELOG]"
checks: {heuristics: {word_count_range: [60, 90]}}
judge_rubric_profile: default
- id: G004
task: Create an FAQ answer in 2–3 sentences.
input: Question: "How do per‑client spend caps work?" Context: [DOC_SNIPPET]
expected_properties: {must_include: ["80/90/100% thresholds"], must_avoid: ["vendor‑specific promises"]}
reference_output: "[EXAMPLE_FAQ]"
checks: {heuristics: {sentence_count_range: [2, 3]}}
judge_rubric_profile: default
- id: G005
task: Generate a SEO meta description (≤155 chars) with target keyword.
input: Page brief: [PAGE_BRIEF] • keyword: [PRIMARY_KEYWORD]
expected_properties: {must_include: ["[PRIMARY_KEYWORD]"], must_avoid: ["clickbait"]}
reference_output: "[META_155]"
checks: {heuristics: {char_count_max: 155, includes_keyword: true}}
judge_rubric_profile: default
- id: G006
task: Redact PII from a policy snippet while preserving meaning.
input: Policy text: [TEXT]
expected_properties: {must_avoid: ["emails", "US phones", "credit cards"]}
reference_output: "[REDACTED_EXAMPLE]"
checks:
regex:
email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
model_pii_check: {provider: aws_comprehend, min_confidence: 0.9, region: [PII_REGION]}
judge_rubric_profile: safety_first
- id: G007
task: Extract 3 quotable lines with citations.
input: Article: [URL]
expected_properties: {must_include: ["source cite"], must_avoid: ["fabricated quotes"]}
reference_output: |
- "…" — [AUTHOR], [YEAR] ([URL])
- …
checks: {heuristics: {quote_count: 3, has_citations: true}}
judge_rubric_profile: default
- id: G008
task: Generate a JSON object matching the provided schema.
input: Schema: {"type":"object","properties":{"title":{"type":"string"}}}
expected_properties: {must_include: ["valid JSON"], must_avoid: ["extra keys"]}
reference_output: '{"title":"[SAMPLE]"}'
checks: {heuristics: {json_valid: true, schema_validate: true}}
judge_rubric_profile: default
- id: G009
task: Headline + 3 bullets for a landing section.
input: Value prop: [VALUE]
expected_properties: {must_include: ["headline","3 bullets"], must_avoid: ["superlatives"]}
reference_output: "[H1] — [B1]; [B2]; [B3]"
checks: {heuristics: {bullet_count: 3, headline_present: true}}
judge_rubric_profile: default
- id: G010
task: Draft a transactional email (75–120 words) confirming receipt.
input: Trigger: [EVENT] • Fields: [FIELDS]
expected_properties: {must_include: ["next step"], must_avoid: ["upsell"]}
reference_output: "[EMAIL_EXAMPLE]"
checks: {heuristics: {word_count_range: [75, 120]}}
judge_rubric_profile: default
- id: G011
task: Create a bug‑report summary in 2 bullets + severity tag.
input: Logs: [SNIPPET]
expected_properties: {must_include: ["severity"], must_avoid: ["guessing root cause"]}
reference_output: "[SEV:P1] …"
checks: {heuristics: {bullet_count: 2, severity_tag_present: true}}
judge_rubric_profile: default
- id: G012
task: Rewrite paragraph in formal tone without changing facts.
input: Paragraph: [TEXT]
expected_properties: {must_include: ["formal"], must_avoid: ["new claims"]}
reference_output: "[FORMAL_VARIANT]"
checks: {heuristics: {tone_target: formal}}
judge_rubric_profile: default
- id: G013
task: Rewrite paragraph in casual tone without changing facts.
input: Paragraph: [TEXT]
expected_properties: {must_include: ["casual"], must_avoid: ["slang"]}
reference_output: "[CASUAL_VARIANT]"
checks: {heuristics: {tone_target: casual}}
judge_rubric_profile: default
- id: G014
task: Draft a YouTube description (120–180 words) with two timestamps.
input: Video outline: [OUTLINE]
expected_properties: {must_include: ["two timestamps"], must_avoid: ["hashtags"]}
reference_output: "[DESCRIPTION]"
checks: {heuristics: {word_count_range: [120, 180], timestamp_count: 2}}
judge_rubric_profile: default
- id: G015
task: Create a support macro answer with one link to docs domain.
input: Issue: [ISSUE] • docs domain: [ALLOWED_DOCS_DOMAIN]
expected_properties: {must_include: ["link to docs"], must_avoid: ["external domains"]}
reference_output: "[MACRO]"
checks: {heuristics: {allowed_domains: [[ALLOWED_DOCS_DOMAIN]], url_count_max: 1}}
judge_rubric_profile: default
- id: G016
task: Generate a 1‑sentence alt text (≤125 chars) for an image.
input: Image description: [DESC]
expected_properties: {must_include: ["what+why"], must_avoid: ["filename"]}
reference_output: "[ALT_TEXT]"
checks: {heuristics: {char_count_max: 125}}
judge_rubric_profile: default
- id: G017
task: Compose a release note title (≤60 chars) and body (≤80 words).
input: Feature: [FEATURE]
expected_properties: {must_include: ["title","body"], must_avoid: ["emoji"]}
reference_output: "[TITLE] — [BODY]"
checks: {heuristics: {title_char_max: 60, word_count_range: [50, 80]}}
judge_rubric_profile: default
- id: G018
task: Extract a single‑sentence pull quote (≤20 words).
input: Transcript: [TEXT]
expected_properties: {must_include: ["≤20 words"], must_avoid: ["attribution inside quote"]}
reference_output: "[QUOTE]"
checks: {heuristics: {word_count_max: 20}}
judge_rubric_profile: default
- id: G019
task: Build a comparison table (3 rows) in Markdown.
input: Competitors: [LIST]
expected_properties: {must_include: ["3 rows"], must_avoid: ["subjective adjectives"]}
reference_output: |
| Option | Price | Notable |
|---|---:|---|
| … | … | … |
checks: {heuristics: {table_rows: 3}}
judge_rubric_profile: default
- id: G020
task: Draft a CTA line (≤12 words) with one verb.
input: Offer: [OFFER]
expected_properties: {must_include: ["one verb"], must_avoid: ["two verbs", "exclamation marks"]}
reference_output: "[CTA]"
checks: {heuristics: {word_count_max: 12, exclamations_max: 0, verb_count: 1}}
judge_rubric_profile: default
judge_rubric.md — LLM‑as‑judge rubric + schema
Use this as the single source of truth for judge prompts, scoring, and output schema. Keep it short and mechanical so different models can follow it.
# /qa/judge_rubric.md
name: default
scoring_dimensions:
- name: Faithfulness
scale: 1–5
rule: All claims are supported by the provided input/sources. No inventions.
- name: Relevance
scale: 1–5
rule: Directly answers the task and audience; no off‑topic filler.
- name: Style/Tone
scale: 1–5
rule: Matches brand tone [BRAND_TONE] and avoids hype/clickbait.
optional_dimensions:
- name: Harm/Policy
scale: 1–5
rule: Flags unsafe or policy‑violating content if applicable.
thresholds:
pass_avg: ">= 4.0"
min_per_dimension: ">= 3.5"
required_dimension: Faithfulness ">= 4.0"
auto_fail_if: ["PII_detected == true", "harm_policy_score < 4"]
# Dual‑judge agreement: require |avg_primary - avg_secondary| <= 0.5 else route to human.
agreement_tolerance: 0.5
## Judge output schema (strict JSON)
{
"scores": {"Faithfulness": 1-5, "Relevance": 1-5, "Style/Tone": 1-5, "Harm/Policy": 1-5?},
"avg_score": 1-5,
"pass": true|false,
"rationale": "2–4 sentences citing specific lines",
"findings": {"claims_checked": ["…"], "violations": ["…"]}
}
## Prompt preamble (insert task/input after this)
You are a strict QA evaluator. Score the candidate against the rubric. Cite specific phrases when deducting points. Return ONLY the JSON matching the schema. If information is missing, lower Faithfulness rather than guessing.
## Model config
primary_model: [LLM_MODEL_FOR_JUDGE]
secondary_model: [SECONDARY_JUDGE_MODEL]
max_tokens: [JUDGE_MAX_OUTPUT_TOKENS]
temperature: 0
heuristics.yml — deterministic + PII checks
Deterministic rules run before any judge call. Keep them fast and cheap. Regex patterns for PII come from vetted sources; add domain‑specific rules as you learn.
# /qa/heuristics.yml
pii_regex:
email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
model_pii:
provider: aws_comprehend
region: [PII_REGION]
min_confidence: 0.9
format_rules:
json_valid: true|false # set per task via golden_set.yaml
schema_validate: true|false # provide JSON Schema when true
word_count_range: [min, max]
char_count_max: [MAX_CHARS]
bullet_count: [N]
headline_present: true|false
table_rows: [N]
timestamp_count: [N]
style_rules:
max_caps_words: 0
max_exclamations: 0
allowed_domains: [[ALLOWED_DOMAIN_1], [ALLOWED_DOMAIN_2]]
url_count_max: 2
policy_rules:
disallow_hate: true
disallow_unsafe_instructions: true
block_list_terms: [[TERM_1], [TERM_2]]
routing:
on_heuristics_fail: "route_to_human|retry|fix_and_rejudge"
on_pii_detected: "auto_fail_and_route"
slack_alerts.json — Block Kit payloads
Three ready‑to‑paste Block Kit payloads: per‑job fail, budget threshold, and queue paused. Post via an Incoming Webhook stored as [SLACK_WEBHOOK_URL].
// /qa/slack_alerts.json
{
"job_fail": {
"text": "QA Wall: FAILED",
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "🚨 QA Wall: FAILED"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "Client: *[CLIENT_NAME]* • Job: *#[JOB_ID]* • Layer: *[LAYER]*\nReason: *[REASON]* • Tokens: [TOKENS] • Cost: $[COST_USD]"}},
{"type": "actions", "elements": [
{"type": "button", "text": {"type": "plain_text", "text": "Open trace"}, "url": "[TRACE_URL]"},
{"type": "button", "text": {"type": "plain_text", "text": "Route to human"}, "style": "primary", "url": "[REVIEW_QUEUE_URL]"}
]}
]
},
"budget_threshold": {
"text": "Budget threshold hit",
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "⚠️ Spend cap warning"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "Client: *[CLIENT_NAME]* hit *[THRESHOLD]%* of monthly cap ($[CAP_USD]).\nMonth‑to‑date: $[MTD_USD] • Jobs: [COUNT]"}},
{"type": "context", "elements": [{"type": "mrkdwn", "text": "Auto‑throttle: [THROTTLE_MODE]. Override in [DASHBOARD_URL]."}]}
]
},
"queue_paused": {
"text": "Queue paused",
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "⏸️ QA Queue Paused"}},
{"type": "section", "text": {"type": "mrkdwn", "text": "Paused *[CLIENT_NAME]* at *[PAUSE_PERCENT]%* of cap. Reason: *[REASON]*."}},
{"type": "actions", "elements": [
{"type": "button", "text": {"type": "plain_text", "text": "Resume 1 hour"}, "value": "resume_1h"},
{"type": "button", "text": {"type": "plain_text", "text": "Resume 24 hours"}, "value": "resume_24h"}
]}
]
}
}
queue.sql — SQLite tables + indexes
Run this once to create tables and indexes. Safe for edge/offline operation. Use a write‑ahead log (WAL) and sync when online.
-- /qa/queue.sql
PRAGMA journal_mode=WAL;
CREATE TABLE IF NOT EXISTS clients (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
monthly_cap_usd REAL NOT NULL,
mtd_spend_usd REAL NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'active', -- active|throttled|paused
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS jobs (
id INTEGER PRIMARY KEY,
client_id INTEGER NOT NULL REFERENCES clients(id),
payload TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'ready', -- ready|running|done|failed|human_review
attempts INTEGER NOT NULL DEFAULT 0,
est_tokens INTEGER DEFAULT 0,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS qa_events (
id INTEGER PRIMARY KEY,
job_id INTEGER NOT NULL REFERENCES jobs(id),
layer TEXT NOT NULL, -- L1|L2|L3
result TEXT NOT NULL, -- pass|fail|escalated
details TEXT, -- JSON blob: scores, violations
tokens INTEGER DEFAULT 0,
cost_usd REAL DEFAULT 0,
ts DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Helpful indexes
CREATE INDEX IF NOT EXISTS idx_jobs_status_created ON jobs(status, created_at);
CREATE INDEX IF NOT EXISTS idx_events_job ON qa_events(job_id);
CREATE INDEX IF NOT EXISTS idx_clients_status ON clients(status);
guardrails.py — caps, routing, Slack, loop sketch
Wire these helpers into your worker. Keep temps low for judges. Always check caps before dispatch and after usage accounting. Replace provider stubs with your SDK calls.
# /qa/guardrails.py
import json, os, time, sqlite3
from typing import Dict, Any
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "[SLACK_WEBHOOK_URL]")
MAX_OUTPUT_TOKENS = int(os.getenv("MAX_OUTPUT_TOKENS", "[MAX_OUTPUT_TOKENS]"))
RETRY_LIMIT = int(os.getenv("RETRY_LIMIT", "2"))
DB_PATH = os.getenv("QA_DB_PATH", "qa.sqlite3")
# --- Spend & caps ---
def est_cost_usd(tokens: int, model: str) -> float:
# Rough estimator; replace with provider pricing table
return tokens * float(os.getenv("USD_PER_TOKEN", "0.000002"))
def client_state(conn, client_id: int) -> Dict[str, Any]:
row = conn.execute("SELECT monthly_cap_usd, mtd_spend_usd, status FROM clients WHERE id=?", (client_id,)).fetchone()
if not row: raise ValueError("Client not found")
return {"cap": row[0], "mtd": row[1], "status": row[2]}
def should_pause(conn, client_id: int) -> bool:
s = client_state(conn, client_id)
return s["mtd"] >= s["cap"]
def throttle_level(conn, client_id: int) -> str:
s = client_state(conn, client_id)
pct = (s["mtd"] / s["cap"]) * 100
if pct >= 100: return "paused"
if pct >= 90: return "throttled_high"
if pct >= 80: return "throttled"
return "normal"
# --- Slack ---
def post_slack(payload: Dict[str, Any]):
import requests
requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)
def alert_budget(conn, client_id: int, client_name: str):
s = client_state(conn, client_id)
pct = int((s["mtd"] / s["cap"]) * 100)
if pct < 80: return
with open(os.path.join(os.path.dirname(__file__), "slack_alerts.json")) as f:
tmpl = json.load(f)["budget_threshold"]
tmpl["blocks"][1]["text"]["text"] = f"Client: *{client_name}* hit *{pct}%* of monthly cap (${s['cap']:.0f}).\nMonth‑to‑date: ${s['mtd']:.2f}"
post_slack(tmpl)
# --- Routing ---
def enforce_per_job_cap(estimated_tokens: int) -> bool:
return estimated_tokens <= MAX_OUTPUT_TOKENS
def route_to_human(conn, job_id: int, reason: str):
conn.execute("UPDATE jobs SET status='human_review' WHERE id=?", (job_id,))
with open(os.path.join(os.path.dirname(__file__), "slack_alerts.json")) as f:
tmpl = json.load(f)["job_fail"]
tmpl["blocks"][1]["text"]["text"] = f"Client: *[CLIENT_NAME]* • Job: *#{job_id}* • Layer: *auto*\nReason: *{reason}*"
post_slack(tmpl)
# --- Core loop sketch (L1→L2→L3) ---
def run_once():
conn = sqlite3.connect(DB_PATH)
conn.row_factory = sqlite3.Row
job = conn.execute("SELECT * FROM jobs WHERE status='ready' ORDER BY created_at LIMIT 1").fetchone()
if not job: return
# Check caps before starting
if should_pause(conn, job["client_id"]):
conn.execute("UPDATE jobs SET status='failed' WHERE id=?", (job["id"],))
return
# Estimate tokens; short‑circuit if too large
est_tokens = job["est_tokens"] or 0
if not enforce_per_job_cap(est_tokens):
route_to_human(conn, job["id"], reason=f"est_tokens>{MAX_OUTPUT_TOKENS}")
conn.commit(); return
# L1: golden‑set regression (run in CI on deploy; here we assume passed)
# L2: heuristics + LLM judges
passed_heuristics = run_heuristics(job)
if not passed_heuristics:
route_to_human(conn, job["id"], reason="heuristics_fail"); conn.commit(); return
scores = dual_judge(job)
if not scores["pass"]:
route_to_human(conn, job["id"], reason="judge_fail"); conn.commit(); return
# Record usage and cost
tokens_used = scores.get("tokens", est_tokens)
cost = est_cost_usd(tokens_used, model="[GEN_MODEL]")
conn.execute("INSERT INTO qa_events(job_id, layer, result, tokens, cost_usd) VALUES (?,?,?,?,?)",
(job["id"], "L2", "pass", tokens_used, cost))
conn.execute("UPDATE clients SET mtd_spend_usd = mtd_spend_usd + ? WHERE id=?", (cost, job["client_id"]))
# L3: human sampling (p = [SAMPLE_RATE])
import random
if random.random() < float(os.getenv("SAMPLE_RATE", "0.1")):
conn.execute("UPDATE jobs SET status='human_review' WHERE id=?", (job["id"],))
else:
conn.execute("UPDATE jobs SET status='done' WHERE id=?", (job["id"],))
alert_budget(conn, job["client_id"], client_name="[CLIENT_NAME]")
conn.commit()
# --- Stubs to implement ---
def run_heuristics(job_row) -> bool:
# Load heuristics.yml and validate candidate output accordingly
return True
def dual_judge(job_row) -> Dict[str, Any]:
# Call primary and secondary judge models with rubric; enforce agreement
return {"pass": True, "avg_score": 4.3, "tokens": job_row["est_tokens"]}