Template

QA Wall Starter Pack (Template + Rubrics + Alerts)

Copy‑ready pack to stand up a three‑layer QA Wall: a 20‑item golden set, a strict LLM‑judge rubric, deterministic + PII heuristics, Slack alert payloads, a SQLite offline queue, and Python guardrails for per‑job and per‑client spend caps.

Copy this folder into your repo as /qa and wire it to your CI and job runner. It implements a three‑layer QA Wall: 1) golden‑set regression on every deploy, 2) per‑job heuristic + LLM‑judge checks, and 3) 10–20% human sampling with Slack alerts and spend caps. Replace [BRACKETS] with your details and run a dry test before enabling auto‑pause.

Quick start

  • Put your 20‑item golden set in golden_set.yaml; version it per release.
  • Point your judge to judge_rubric.md; require rationale + structured score output.
  • Load heuristics.yml rules; start strict on PII and lenient on style for week one.
  • Create a Slack Incoming Webhook and paste the URL into your secrets as [SLACK_WEBHOOK_URL].
  • Create the SQLite tables from queue.sql (works offline; sync later if you use replication).
  • Enforce per‑job and per‑client caps with guardrails.py; set alerts at 80/90/100% of [MONTHLY_SPEND_CAP_USD].

Folder layout

qa/
  golden_set.yaml           # 20-item dataset for Layer 1 regression
  judge_rubric.md           # Rubric + scoring schema for LLM-as-judge
  heuristics.yml            # Deterministic + regex + safety rules
  slack_alerts.json         # Block Kit payload examples
  queue.sql                 # SQLite tables + indexes for offline queue
  guardrails.py             # Spend caps, routing, and Slack alerts
  README.md                 # Paste the 'Quick start' from this template

golden_set.yaml — 20‑item starter set

Use as a working starting point. Keep ids stable; change only inputs/expectations as your product evolves. Reference outputs are exemplars, not the only valid answer — judges should use the rubric below to score new outputs against intent and sources.

# /qa/golden_set.yaml
version: [GOLDENSET_VERSION]
owner: [OWNER_EMAIL]
notes: |
  Scope: [SCOPE_NOTE]. Each item includes task, input, expected_properties, a short exemplar, and checks.
items:
  - id: G001
    task: Generate a 120–150 word LinkedIn post from this brief; include one statistic and a CTA.
    input: |
      Brief: Why async ops beat meetings for cross‑timezone teams.
      Source refs: [URL_1], [URL_2]
    expected_properties:
      brand_tone: pragmatic, B2B, no hype
      must_include: ["one statistic with source", "CTA to book audit"]
      must_avoid: ["ALL CAPS", "PII"]
    reference_output: "[SHORT_ON_BRAND_EXEMPLAR]"
    checks:
      heuristics:
        word_count_range: [120, 150]
        max_caps_words: 0
        max_exclamations: 1
        urls_must_resolve: true
      regex:
        email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
        us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
        credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
      judge_rubric_profile: default

  - id: G002
    task: Summarize a customer interview to 4 bullets with one numeric outcome.
    input: Transcript excerpt (250–400 words)
    expected_properties:
      must_include: ["one metric"]
      must_avoid: ["PII"]
    reference_output: |
      - Cut onboarding time by 43% after…
      - …
    checks:
      heuristics: {bullet_count: 4, numbers_present: true}
      regex:
        credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
      judge_rubric_profile: default

  - id: G003
    task: Write a product update changelog entry (max 90 words) with one benefit line.
    input: Feature diff: [DIFF_SNIPPET]
    expected_properties: {must_include: ["benefit"], must_avoid: ["dates older than [MIN_DATE]"]}
    reference_output: "[EXAMPLE_CHANGELOG]"
    checks: {heuristics: {word_count_range: [60, 90]}}
    judge_rubric_profile: default

  - id: G004
    task: Create an FAQ answer in 2–3 sentences.
    input: Question: "How do per‑client spend caps work?" Context: [DOC_SNIPPET]
    expected_properties: {must_include: ["80/90/100% thresholds"], must_avoid: ["vendor‑specific promises"]}
    reference_output: "[EXAMPLE_FAQ]"
    checks: {heuristics: {sentence_count_range: [2, 3]}}
    judge_rubric_profile: default

  - id: G005
    task: Generate a SEO meta description (≤155 chars) with target keyword.
    input: Page brief: [PAGE_BRIEF] • keyword: [PRIMARY_KEYWORD]
    expected_properties: {must_include: ["[PRIMARY_KEYWORD]"], must_avoid: ["clickbait"]}
    reference_output: "[META_155]"
    checks: {heuristics: {char_count_max: 155, includes_keyword: true}}
    judge_rubric_profile: default

  - id: G006
    task: Redact PII from a policy snippet while preserving meaning.
    input: Policy text: [TEXT]
    expected_properties: {must_avoid: ["emails", "US phones", "credit cards"]}
    reference_output: "[REDACTED_EXAMPLE]"
    checks:
      regex:
        email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
        us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
        credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'
      model_pii_check: {provider: aws_comprehend, min_confidence: 0.9, region: [PII_REGION]}
      judge_rubric_profile: safety_first

  - id: G007
    task: Extract 3 quotable lines with citations.
    input: Article: [URL]
    expected_properties: {must_include: ["source cite"], must_avoid: ["fabricated quotes"]}
    reference_output: |
      - "…" — [AUTHOR], [YEAR] ([URL])
      - …
    checks: {heuristics: {quote_count: 3, has_citations: true}}
    judge_rubric_profile: default

  - id: G008
    task: Generate a JSON object matching the provided schema.
    input: Schema: {"type":"object","properties":{"title":{"type":"string"}}}
    expected_properties: {must_include: ["valid JSON"], must_avoid: ["extra keys"]}
    reference_output: '{"title":"[SAMPLE]"}'
    checks: {heuristics: {json_valid: true, schema_validate: true}}
    judge_rubric_profile: default

  - id: G009
    task: Headline + 3 bullets for a landing section.
    input: Value prop: [VALUE]
    expected_properties: {must_include: ["headline","3 bullets"], must_avoid: ["superlatives"]}
    reference_output: "[H1] — [B1]; [B2]; [B3]"
    checks: {heuristics: {bullet_count: 3, headline_present: true}}
    judge_rubric_profile: default

  - id: G010
    task: Draft a transactional email (75–120 words) confirming receipt.
    input: Trigger: [EVENT] • Fields: [FIELDS]
    expected_properties: {must_include: ["next step"], must_avoid: ["upsell"]}
    reference_output: "[EMAIL_EXAMPLE]"
    checks: {heuristics: {word_count_range: [75, 120]}}
    judge_rubric_profile: default

  - id: G011
    task: Create a bug‑report summary in 2 bullets + severity tag.
    input: Logs: [SNIPPET]
    expected_properties: {must_include: ["severity"], must_avoid: ["guessing root cause"]}
    reference_output: "[SEV:P1] …"
    checks: {heuristics: {bullet_count: 2, severity_tag_present: true}}
    judge_rubric_profile: default

  - id: G012
    task: Rewrite paragraph in formal tone without changing facts.
    input: Paragraph: [TEXT]
    expected_properties: {must_include: ["formal"], must_avoid: ["new claims"]}
    reference_output: "[FORMAL_VARIANT]"
    checks: {heuristics: {tone_target: formal}}
    judge_rubric_profile: default

  - id: G013
    task: Rewrite paragraph in casual tone without changing facts.
    input: Paragraph: [TEXT]
    expected_properties: {must_include: ["casual"], must_avoid: ["slang"]}
    reference_output: "[CASUAL_VARIANT]"
    checks: {heuristics: {tone_target: casual}}
    judge_rubric_profile: default

  - id: G014
    task: Draft a YouTube description (120–180 words) with two timestamps.
    input: Video outline: [OUTLINE]
    expected_properties: {must_include: ["two timestamps"], must_avoid: ["hashtags"]}
    reference_output: "[DESCRIPTION]"
    checks: {heuristics: {word_count_range: [120, 180], timestamp_count: 2}}
    judge_rubric_profile: default

  - id: G015
    task: Create a support macro answer with one link to docs domain.
    input: Issue: [ISSUE] • docs domain: [ALLOWED_DOCS_DOMAIN]
    expected_properties: {must_include: ["link to docs"], must_avoid: ["external domains"]}
    reference_output: "[MACRO]"
    checks: {heuristics: {allowed_domains: [[ALLOWED_DOCS_DOMAIN]], url_count_max: 1}}
    judge_rubric_profile: default

  - id: G016
    task: Generate a 1‑sentence alt text (≤125 chars) for an image.
    input: Image description: [DESC]
    expected_properties: {must_include: ["what+why"], must_avoid: ["filename"]}
    reference_output: "[ALT_TEXT]"
    checks: {heuristics: {char_count_max: 125}}
    judge_rubric_profile: default

  - id: G017
    task: Compose a release note title (≤60 chars) and body (≤80 words).
    input: Feature: [FEATURE]
    expected_properties: {must_include: ["title","body"], must_avoid: ["emoji"]}
    reference_output: "[TITLE] — [BODY]"
    checks: {heuristics: {title_char_max: 60, word_count_range: [50, 80]}}
    judge_rubric_profile: default

  - id: G018
    task: Extract a single‑sentence pull quote (≤20 words).
    input: Transcript: [TEXT]
    expected_properties: {must_include: ["≤20 words"], must_avoid: ["attribution inside quote"]}
    reference_output: "[QUOTE]"
    checks: {heuristics: {word_count_max: 20}}
    judge_rubric_profile: default

  - id: G019
    task: Build a comparison table (3 rows) in Markdown.
    input: Competitors: [LIST]
    expected_properties: {must_include: ["3 rows"], must_avoid: ["subjective adjectives"]}
    reference_output: |
      | Option | Price | Notable |
      |---|---:|---|
      | … | … | … |
    checks: {heuristics: {table_rows: 3}}
    judge_rubric_profile: default

  - id: G020
    task: Draft a CTA line (≤12 words) with one verb.
    input: Offer: [OFFER]
    expected_properties: {must_include: ["one verb"], must_avoid: ["two verbs", "exclamation marks"]}
    reference_output: "[CTA]"
    checks: {heuristics: {word_count_max: 12, exclamations_max: 0, verb_count: 1}}
    judge_rubric_profile: default

judge_rubric.md — LLM‑as‑judge rubric + schema

Use this as the single source of truth for judge prompts, scoring, and output schema. Keep it short and mechanical so different models can follow it.

# /qa/judge_rubric.md

name: default
scoring_dimensions:
  - name: Faithfulness
    scale: 1–5
    rule: All claims are supported by the provided input/sources. No inventions.
  - name: Relevance
    scale: 1–5
    rule: Directly answers the task and audience; no off‑topic filler.
  - name: Style/Tone
    scale: 1–5
    rule: Matches brand tone [BRAND_TONE] and avoids hype/clickbait.
optional_dimensions:
  - name: Harm/Policy
    scale: 1–5
    rule: Flags unsafe or policy‑violating content if applicable.

thresholds:
  pass_avg: ">= 4.0"
  min_per_dimension: ">= 3.5"
  required_dimension: Faithfulness ">= 4.0"
  auto_fail_if: ["PII_detected == true", "harm_policy_score < 4"]

# Dual‑judge agreement: require |avg_primary - avg_secondary| <= 0.5 else route to human.
agreement_tolerance: 0.5

## Judge output schema (strict JSON)
{
  "scores": {"Faithfulness": 1-5, "Relevance": 1-5, "Style/Tone": 1-5, "Harm/Policy": 1-5?},
  "avg_score": 1-5,
  "pass": true|false,
  "rationale": "2–4 sentences citing specific lines",
  "findings": {"claims_checked": ["…"], "violations": ["…"]}
}

## Prompt preamble (insert task/input after this)
You are a strict QA evaluator. Score the candidate against the rubric. Cite specific phrases when deducting points. Return ONLY the JSON matching the schema. If information is missing, lower Faithfulness rather than guessing.

## Model config
primary_model: [LLM_MODEL_FOR_JUDGE]
secondary_model: [SECONDARY_JUDGE_MODEL]
max_tokens: [JUDGE_MAX_OUTPUT_TOKENS]
temperature: 0

heuristics.yml — deterministic + PII checks

Deterministic rules run before any judge call. Keep them fast and cheap. Regex patterns for PII come from vetted sources; add domain‑specific rules as you learn.

# /qa/heuristics.yml

pii_regex:
  email: '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
  us_phone: '(?:\+1[ .-]?)?(?:\(?\d{3}\)?[ .-]?\d{3}[ .-]?\d{4})'
  credit_card: '^((4\d{3})|(5[1-5]\d{2})|(6011)|(7\d{3}))-?\d{4}-?\d{4}-?\d{4}|3[47]\d{13}$'

model_pii:
  provider: aws_comprehend
  region: [PII_REGION]
  min_confidence: 0.9

format_rules:
  json_valid: true|false        # set per task via golden_set.yaml
  schema_validate: true|false   # provide JSON Schema when true
  word_count_range: [min, max]
  char_count_max: [MAX_CHARS]
  bullet_count: [N]
  headline_present: true|false
  table_rows: [N]
  timestamp_count: [N]

style_rules:
  max_caps_words: 0
  max_exclamations: 0
  allowed_domains: [[ALLOWED_DOMAIN_1], [ALLOWED_DOMAIN_2]]
  url_count_max: 2

policy_rules:
  disallow_hate: true
  disallow_unsafe_instructions: true
  block_list_terms: [[TERM_1], [TERM_2]]

routing:
  on_heuristics_fail: "route_to_human|retry|fix_and_rejudge"
  on_pii_detected: "auto_fail_and_route"

slack_alerts.json — Block Kit payloads

Three ready‑to‑paste Block Kit payloads: per‑job fail, budget threshold, and queue paused. Post via an Incoming Webhook stored as [SLACK_WEBHOOK_URL].

// /qa/slack_alerts.json
{
  "job_fail": {
    "text": "QA Wall: FAILED",
    "blocks": [
      {"type": "header", "text": {"type": "plain_text", "text": "🚨 QA Wall: FAILED"}},
      {"type": "section", "text": {"type": "mrkdwn", "text": "Client: *[CLIENT_NAME]*  • Job: *#[JOB_ID]*  • Layer: *[LAYER]*\nReason: *[REASON]* • Tokens: [TOKENS] • Cost: $[COST_USD]"}},
      {"type": "actions", "elements": [
        {"type": "button", "text": {"type": "plain_text", "text": "Open trace"}, "url": "[TRACE_URL]"},
        {"type": "button", "text": {"type": "plain_text", "text": "Route to human"}, "style": "primary", "url": "[REVIEW_QUEUE_URL]"}
      ]}
    ]
  },
  "budget_threshold": {
    "text": "Budget threshold hit",
    "blocks": [
      {"type": "header", "text": {"type": "plain_text", "text": "⚠️ Spend cap warning"}},
      {"type": "section", "text": {"type": "mrkdwn", "text": "Client: *[CLIENT_NAME]* hit *[THRESHOLD]%* of monthly cap ($[CAP_USD]).\nMonth‑to‑date: $[MTD_USD] • Jobs: [COUNT]"}},
      {"type": "context", "elements": [{"type": "mrkdwn", "text": "Auto‑throttle: [THROTTLE_MODE]. Override in [DASHBOARD_URL]."}]}
    ]
  },
  "queue_paused": {
    "text": "Queue paused",
    "blocks": [
      {"type": "header", "text": {"type": "plain_text", "text": "⏸️ QA Queue Paused"}},
      {"type": "section", "text": {"type": "mrkdwn", "text": "Paused *[CLIENT_NAME]* at *[PAUSE_PERCENT]%* of cap. Reason: *[REASON]*."}},
      {"type": "actions", "elements": [
        {"type": "button", "text": {"type": "plain_text", "text": "Resume 1 hour"}, "value": "resume_1h"},
        {"type": "button", "text": {"type": "plain_text", "text": "Resume 24 hours"}, "value": "resume_24h"}
      ]}
    ]
  }
}

queue.sql — SQLite tables + indexes

Run this once to create tables and indexes. Safe for edge/offline operation. Use a write‑ahead log (WAL) and sync when online.

-- /qa/queue.sql
PRAGMA journal_mode=WAL;

CREATE TABLE IF NOT EXISTS clients (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  monthly_cap_usd REAL NOT NULL,
  mtd_spend_usd REAL NOT NULL DEFAULT 0,
  status TEXT NOT NULL DEFAULT 'active', -- active|throttled|paused
  updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS jobs (
  id INTEGER PRIMARY KEY,
  client_id INTEGER NOT NULL REFERENCES clients(id),
  payload TEXT NOT NULL,
  status TEXT NOT NULL DEFAULT 'ready', -- ready|running|done|failed|human_review
  attempts INTEGER NOT NULL DEFAULT 0,
  est_tokens INTEGER DEFAULT 0,
  created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS qa_events (
  id INTEGER PRIMARY KEY,
  job_id INTEGER NOT NULL REFERENCES jobs(id),
  layer TEXT NOT NULL,         -- L1|L2|L3
  result TEXT NOT NULL,        -- pass|fail|escalated
  details TEXT,                -- JSON blob: scores, violations
  tokens INTEGER DEFAULT 0,
  cost_usd REAL DEFAULT 0,
  ts DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Helpful indexes
CREATE INDEX IF NOT EXISTS idx_jobs_status_created ON jobs(status, created_at);
CREATE INDEX IF NOT EXISTS idx_events_job ON qa_events(job_id);
CREATE INDEX IF NOT EXISTS idx_clients_status ON clients(status);

guardrails.py — caps, routing, Slack, loop sketch

Wire these helpers into your worker. Keep temps low for judges. Always check caps before dispatch and after usage accounting. Replace provider stubs with your SDK calls.

# /qa/guardrails.py
import json, os, time, sqlite3
from typing import Dict, Any

SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "[SLACK_WEBHOOK_URL]")
MAX_OUTPUT_TOKENS = int(os.getenv("MAX_OUTPUT_TOKENS", "[MAX_OUTPUT_TOKENS]"))
RETRY_LIMIT = int(os.getenv("RETRY_LIMIT", "2"))

DB_PATH = os.getenv("QA_DB_PATH", "qa.sqlite3")

# --- Spend & caps ---

def est_cost_usd(tokens: int, model: str) -> float:
    # Rough estimator; replace with provider pricing table
    return tokens * float(os.getenv("USD_PER_TOKEN", "0.000002"))


def client_state(conn, client_id: int) -> Dict[str, Any]:
    row = conn.execute("SELECT monthly_cap_usd, mtd_spend_usd, status FROM clients WHERE id=?", (client_id,)).fetchone()
    if not row: raise ValueError("Client not found")
    return {"cap": row[0], "mtd": row[1], "status": row[2]}


def should_pause(conn, client_id: int) -> bool:
    s = client_state(conn, client_id)
    return s["mtd"] >= s["cap"]


def throttle_level(conn, client_id: int) -> str:
    s = client_state(conn, client_id)
    pct = (s["mtd"] / s["cap"]) * 100
    if pct >= 100: return "paused"
    if pct >= 90:  return "throttled_high"
    if pct >= 80:  return "throttled"
    return "normal"

# --- Slack ---

def post_slack(payload: Dict[str, Any]):
    import requests
    requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)


def alert_budget(conn, client_id: int, client_name: str):
    s = client_state(conn, client_id)
    pct = int((s["mtd"] / s["cap"]) * 100)
    if pct < 80: return
    with open(os.path.join(os.path.dirname(__file__), "slack_alerts.json")) as f:
        tmpl = json.load(f)["budget_threshold"]
    tmpl["blocks"][1]["text"]["text"] = f"Client: *{client_name}* hit *{pct}%* of monthly cap (${s['cap']:.0f}).\nMonth‑to‑date: ${s['mtd']:.2f}"
    post_slack(tmpl)

# --- Routing ---

def enforce_per_job_cap(estimated_tokens: int) -> bool:
    return estimated_tokens <= MAX_OUTPUT_TOKENS


def route_to_human(conn, job_id: int, reason: str):
    conn.execute("UPDATE jobs SET status='human_review' WHERE id=?", (job_id,))
    with open(os.path.join(os.path.dirname(__file__), "slack_alerts.json")) as f:
        tmpl = json.load(f)["job_fail"]
    tmpl["blocks"][1]["text"]["text"] = f"Client: *[CLIENT_NAME]*  • Job: *#{job_id}*  • Layer: *auto*\nReason: *{reason}*"
    post_slack(tmpl)

# --- Core loop sketch (L1→L2→L3) ---

def run_once():
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    job = conn.execute("SELECT * FROM jobs WHERE status='ready' ORDER BY created_at LIMIT 1").fetchone()
    if not job: return

    # Check caps before starting
    if should_pause(conn, job["client_id"]):
        conn.execute("UPDATE jobs SET status='failed' WHERE id=?", (job["id"],))
        return

    # Estimate tokens; short‑circuit if too large
    est_tokens = job["est_tokens"] or 0
    if not enforce_per_job_cap(est_tokens):
        route_to_human(conn, job["id"], reason=f"est_tokens>{MAX_OUTPUT_TOKENS}")
        conn.commit(); return

    # L1: golden‑set regression (run in CI on deploy; here we assume passed)

    # L2: heuristics + LLM judges
    passed_heuristics = run_heuristics(job)
    if not passed_heuristics:
        route_to_human(conn, job["id"], reason="heuristics_fail"); conn.commit(); return

    scores = dual_judge(job)
    if not scores["pass"]:
        route_to_human(conn, job["id"], reason="judge_fail"); conn.commit(); return

    # Record usage and cost
    tokens_used = scores.get("tokens", est_tokens)
    cost = est_cost_usd(tokens_used, model="[GEN_MODEL]")
    conn.execute("INSERT INTO qa_events(job_id, layer, result, tokens, cost_usd) VALUES (?,?,?,?,?)",
                 (job["id"], "L2", "pass", tokens_used, cost))
    conn.execute("UPDATE clients SET mtd_spend_usd = mtd_spend_usd + ? WHERE id=?", (cost, job["client_id"]))

    # L3: human sampling (p = [SAMPLE_RATE])
    import random
    if random.random() < float(os.getenv("SAMPLE_RATE", "0.1")):
        conn.execute("UPDATE jobs SET status='human_review' WHERE id=?", (job["id"],))
    else:
        conn.execute("UPDATE jobs SET status='done' WHERE id=?", (job["id"],))

    alert_budget(conn, job["client_id"], client_name="[CLIENT_NAME]")
    conn.commit()

# --- Stubs to implement ---

def run_heuristics(job_row) -> bool:
    # Load heuristics.yml and validate candidate output accordingly
    return True


def dual_judge(job_row) -> Dict[str, Any]:
    # Call primary and secondary judge models with rubric; enforce agreement
    return {"pass": True, "avg_score": 4.3, "tokens": job_row["est_tokens"]}