Bounded Agent Blueprint: Lead‑Qual, Weekly Reporting, and Inbox Triage — SOPs, Schemas, Evals, KPIs
Operator‑grade playbook to stand up three bounded agents that make money: copy‑ready SOPs, strict JSON Schemas, acceptance tests, n8n/Make maps, a golden‑set eval harness, and a KPI + pricing model. Built for nomad founders who want reliable, low‑maintenance revenue units.
Build three revenue‑relevant, bounded agents you can run from anywhere: Lead Qualification, Weekly Reporting, and Inbox Triage. This guide includes copy‑ready SOPs, JSON Schemas, acceptance tests, n8n/Make maps, a lightweight evaluation harness, and a KPI + pricing model so you can ship one agent in 14 days and keep it stable on café Wi‑Fi and async hours.
How to use this blueprint
Use this like a Notion playbook. Clone the sections you need, paste the schemas/prompts into your stack, and follow the 14‑day sprint.
14‑Day ship plan:
- Days 1–2: Pick one agent and define its single job + inputs/outputs. Write the ICP and success criteria.
- Days 3–4: Wire the connectors (CRM/Databox/Inbox) and drop in the JSON Schemas + prompts.
- Days 5–6: Build a 50‑row golden set (5–10 edge cases included). Save as CSV.
- Days 7–8: Add acceptance tests + eval harness. Schedule nightly runs.
- Days 9–10: Add error handling: retries/backoff + global error workflow + alerts. Mask PII in logs.
- Day 11: Soft‑launch behind human‑in‑the‑loop (100% review of outputs).
- Day 12: Drop to 20% spot‑check on low‑risk cases; route sensitive cases to human queue.
- Days 13–14: Track KPIs (parse rate, FP rate, FRT, cost/run). If pass, package price + SLOs and ship to first client.
Lisbon Test (pass/fail):
- Survives shaky Wi‑Fi (async job design + retries).
- Handles auth expiry and 429s without human rescue (re‑auth + backoff paths).
- Produces strict JSON/HTML the first time (schemas + validators).
- Escalates edge cases to a human with context attached (HITL + incident SOP).
Blueprint 1 — Lead Qualification Agent (WhatsApp/Web/Email)
Single job: qualify inbound interest using a 3–5 question ruleset, score fit, and either auto‑book or hand off. Channel examples: WhatsApp, email reply, web form, chat widget.
Recommended stack:
- Transport: WhatsApp Business API (Twilio/WATI) or web form/email.
- Orchestration: n8n or Make.com.
- LLM: any model with JSON Schema support or strong JSON adherence.
- Calendar/CRM: Google Calendar/Calendly + HubSpot/Pipedrive/Airtable.
- Observability: Langfuse (events + eval traces) + Slack alerts.
n8n/Make module map:
- Trigger: Webhook (form) or WhatsApp message received.
- Normalize: Transform to input JSON per schema.
- LLM: Ask only the allowed questions if missing; otherwise score.
- Decision: If confidence ≥ 0.75 and nextstep=bookcall → create calendar event + send link; else route to nurture/disqualify or human queue.
- Log: Post result + payload to Langfuse; emit metrics.
- Alerts: Confidence < 0.5 or PII/sensitive → Slack channel + assign owner.
Input JSON Schema (v1):
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://statelessfounder.com/schemas/lead-qual-input.v1.json",
"type": "object",
"required": ["name", "contact", "source", "answers", "icp"],
"properties": {
"name": {"type": "string", "minLength": 1},
"contact": {"type": "object", "required": ["channel"],
"properties": {
"channel": {"enum": ["email", "whatsapp", "sms", "chat", "webform"]},
"email": {"type": "string", "format": "email"},
"phone": {"type": "string"},
"handle": {"type": "string"}
}
},
"source": {"type": "string"},
"answers": {"type": "object", "additionalProperties": {"type": ["string", "number", "boolean"]}},
"icp": {"type": "object", "description": "Ideal customer profile summary used for scoring.",
"properties": {
"industry": {"type": "string"},
"companysizemin": {"type": "integer"},
"companysizemax": {"type": "integer"},
"geo": {"type": "string"}
},
"required": ["industry"]
}
},
"additionalProperties": false
}
```
Output JSON Schema (v1):
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://statelessfounder.com/schemas/lead-qual-output.v1.json",
"type": "object",
"required": ["score", "reasons", "next_step", "confidence"],
"properties": {
"status": {"enum": ["ok", "needmoreinfo"]},
"missing": {"type": "array", "items": {"type": "string"}},
"score": {"type": "integer", "minimum": 0, "maximum": 100},
"reasons": {"type": "array", "items": {"type": "string"}, "minItems": 1},
"nextstep": {"enum": ["bookcall", "nurture", "disqualify"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"booking_link": {"type": "string", "format": "uri"}
},
"additionalProperties": false
}
```
Prompt scaffold (system):
```
You are a lead qualification assistant. Return STRICT JSON matching lead-qual-output.v1.json.
Use only data in lead-qual-input.v1.json. If required answers are missing, set status:"needmoreinfo" and list missing.
Do not invent data. If nextstep is "bookcall", confidence must be ≥ 0.75.
```
Acceptance tests (ship with these):
- Parse rate: 100 valid JSON outputs over 100 test runs on golden set.
- Business rule: Auto‑fail if nextstep=bookcall AND confidence < 0.75.
- Safety: Redact email/phone in logs; no PII in Slack messages.
- QA: Randomly sample 10% of “book_call” for human review daily.
Escalation matrix:
- Confidence < 0.5 → Human review queue.
- “High deal value” keywords or competitor mentions → Tag owner and escalate immediately.
- Three consecutive failures in 10 minutes (same node) → open incident, notify on‑call, pause auto‑booking.
Core KPIs:
- Contact→Qualified% (7‑day rolling), Auto‑book rate, False‑positive rate (human reversals), First‑response time (FRT).
Blueprint 2 — Weekly Client Reporting Agent (Slack + HTML Email)
Single job: pull weekly KPIs from trusted sources, validate math, summarize to Slack and an HTML email. No dashboards to click, no hallucinated numbers.
Recommended stack:
- Data: Databox MCP (to GA4/Ads/CRM) or native connectors.
- Orchestration: n8n (CRON Monday 09:00 local) or Make scheduler.
- LLM: model capable of strict JSON + HTML generation.
- Output: Slack channel + HTML email to client list.
n8n/Make module map:
- Trigger: Schedule.
- Fetch: GA4/Ads/CRM metrics via Databox MCP.
- Validate: Check totals = sum(children); percentages in [0,100]. If missing data, set value="unknown".
- LLM: Produce JSON with {slack[], html}.
- HTML validator: fail fast on malformed tags/styles; if fail → human QA + retry.
- Send: Slack + email. Log artifacts + costs to Langfuse.
Output JSON Schema (v1):
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://statelessfounder.com/schemas/weekly-report-output.v1.json",
"type": "object",
"required": ["slack", "html"],
"properties": {
"slack": {
"type": "array",
"items": {"type": "string", "maxLength": 140},
"minItems": 3, "maxItems": 12
},
"html": {"type": "string", "minLength": 200}
},
"additionalProperties": false
}
```
Prompt scaffold (system):
```
Return STRICT JSON per weekly-report-output.v1.json.
- Slack: ≤12 bullet lines, each < 140 chars.
- HTML: table-based layout, inline CSS only, no external assets. If any metric source is missing, write "unknown" and explain briefly at the bottom.
Validate numeric sums; if inconsistency, emit a "consistency_note" in an HTML footer.
```
Acceptance tests:
- HTML passes validator; no unclosed tags.
- Numeric checks recompute within ±0.1% tolerance.
- If any dataset is missing, HTML includes an “unknown” callout and no fabricated values.
- Slack array contains 3–12 lines; each < 140 characters.
Core KPIs:
- On‑time delivery (% Mondays by 09:15), Parse/validation pass rate, Client replies within 24h, Support tickets opened from report (should be low).
Blueprint 3 — Inbox Triage Agent (Classification → Routing → Draft)
Single job: classify incoming messages, set priority, and route to the right owner/folder. Drafting replies is optional and should require human send.
Recommended stack:
- Inbox: Gmail/Outlook or Helpdesk (Help Scout/Zendesk).
- Orchestration: n8n/Make with per‑module error handling.
- LLM: classification‑focused with deterministic labels.
- Storage: Airtable/DB table for message log + outcomes.
Allowed labels: {support, sales, billing, spam, personal}
Sensitive intents: {legal, refund, escalation}
Input JSON (expect):
```json
{
"message_id": "str",
"thread_id": "str",
"subject": "str",
"from": "email",
"body": "str",
"attachments": [{"filename": "str", "mime": "str"}],
"received_at": "ISO-8601"
}
```
Output JSON Schema (v1):
```json
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://statelessfounder.com/schemas/inbox-triage-output.v1.json",
"type": "object",
"required": ["label", "priority", "reason", "sensitive"],
"properties": {
"label": {"enum": ["support", "sales", "billing", "spam", "personal"]},
"priority": {"type": "integer", "minimum": 1, "maximum": 5},
"reason": {"type": "string"},
"sensitive": {"type": "boolean"},
"route_to": {"type": "string", "description": "Mailbox/folder/owner ID"}
},
"additionalProperties": false
}
```
Prompt scaffold (system):
```
Classify the message into one allowed label. Return STRICT JSON per inbox-triage-output.v1.json.
Never write to customers. If refund/legal/escalation is detected, set sensitive=true and raise priority (1=highest).
```
Acceptance tests:
- Labels limited to the allowed set; any other value auto‑fails.
- PII masking in logs (email/phone regex) before storage/alerts.
- Sensitive=true always routes to human queue; never send auto‑replies.
- Duplicates: same thread_id within 10 minutes → dedupe and skip.
Core KPIs:
- First response time (FRT), Deflection rate (auto‑resolved without human), Sensitive‑case catch rate, Manual review hit rate.
Lightweight evaluation harness (golden set + nightly gates)
Own your reliability. The eval harness keeps you from shipping regressions and gives you numbers for clients.
Golden set (per agent):
- Size: 50 rows minimum (10 edge cases: missing fields, OOO auto‑replies, malformed HTML, rate‑limit simulation, duplicate thread, high‑value exceptions).
- Format: CSV with columns: inputjson, expectedoutput_json, notes, tags.
Sample CSV (3 rows shown):
```csv
inputjson,expectedoutput_json,notes,tags
"{\"name\":\"Ada\",\"contact\":{\"channel\":\"webform\",\"email\":\"ada@example.com\"},\"source\":\"LP\",\"answers\":{\"budget\":\"2k\"},\"icp\":{\"industry\":\"coaching\"}}","{\"status\":\"needmoreinfo\",\"missing\":[\"timeline\"],\"score\":0,\"reasons\":[\"Missing timeline\"],\"next_step\":\"nurture\",\"confidence\":0.4}","Lead-qual missing timeline","edge,lead-qual"
"{\"slack\":[],\"html\":\"<html><body><table><tr><td>T</td></tr></table></body></html>\"}","SCHEMA_FAIL","Reporting with empty Slack block","edge,reporting"
"{\"messageid\":\"1\",\"threadid\":\"A\",\"subject\":\"Refund\",\"from\":\"c@example.com\",\"body\":\"I want a refund\"}","{\"label\":\"billing\",\"priority\":1,\"reason\":\"Refund keyword\",\"sensitive\":true,\"routeto\":\"billingqueue\"}","Inbox refund path","edge,inbox"
```
Metrics to track:
- Schema parse rate = validoutputs / totalruns
- Business rule pass rate = passesbusinessrules / total_runs
- False‑positive rate (agent booked call but human reversed) = fp / totalbookcalls
- Deflection rate (triage only) = autoresolved / totaltickets
- Manual review hit rate = senttohuman / total_runs
- Latency p95 and cost/run (sum of LLM + platform costs)
Harness setup (Langfuse/OpenAI‑evals friendly):
- Store golden rows as dataset. Nightly job runs agent on dataset; compare outputs; upload traces + costs.
- Gate deploys: if parse rate or business rule pass rate drops by >2% vs last green, auto‑block and alert.
- Weekly: sample 5 outputs per agent for human rubric scoring (1–5) and trend.
Command sketch (pseudocode):
```
make eval AGENT=lead-qual DATA=golden/leadqual.csv OUT=reports/leadqual_$(date).json
```
Governance defaults:
- Keep 90 days of eval artifacts.
- Tag runs by model+prompt version.
- Treat model changes like code changes (review + rollback path).
Observability + Incident SOP (retries, backoff, escalation)
Most failures aren’t “AI problems.” They’re ops: timeouts, 429s, auth expiry, brittle HTML. Put guardrails in writing.
Global error workflow (copy‑ready SOP):
- Detection: Route all n8n/Make errors to a global Error Workflow. Capture: workflowid, node, message, executionurl, retryOf.
- Classify:
- HTTP 5xx/timeout/429 → Retry path.
- 4xx auth (401/403) → Re‑auth path.
- Schema/validation fail → Human QA path.
- Retry: 3 attempts with exponential backoff (1m, 5m, 15m). Jitter recommended.
- Escalation: Still failing after retries or confidence < threshold → open ticket, notify on‑call (Slack/PagerDuty). Attach execution_url and a redacted payload snippet.
- Post‑incident: Create a postmortem card. Add a regression case to the golden set. Tag the run with a new incident ID.
Operational safeguards:
- PII: Mask emails/phones in logs and alerts.
- Rate limits: Centralize backoff settings; never parallelize beyond provider quotas.
- Auth: Monitor token age; auto‑rotate credentials with least privilege; alert before expiry.
- Zapier caveat: Keep trigger work under 30s; push long calls to async callbacks/queues.
Expected outcomes:
- Incidents resolved within 30 minutes during business hours.
- <1 false‑positive booking per 100 auto‑booked calls.
- Weekly reporting failure rate <2% (auto‑retry recovers most).
KPI dashboard + pricing calculator (copy the math)
Track the numbers that pay the bills and price with a margin, not vibes.
KPI sheet (minimum set):
- Parse rate (schema): validjson / totalruns
- Business rule pass rate: passesrules / totalruns
- False‑positive rate (lead‑qual): fp / totalbookcalls
- FRT (inbox/lead‑qual): p50 and p95 in minutes
- On‑time delivery (reporting): deliveriesbefore0915 / totalweeks
- Cost/run: llmcost + platformcost + infra_cost
Pricing calculator (logic):
Inputs:
- setup_hours
- hourly_rate
- platform_monthly (n8n/Make/Zapier tier)
- avgrunsper_month
- llmcostper_run (tokens × price/1k + vector/db/query costs)
- maintenancehoursper_month
- targetgrossmargin (e.g., 0.6)
Formulas:
- setupfee = setuphours × hourly_rate + one‑time tooling (if any)
- monthlydirectcost = platformmonthly + (avgrunspermonth × llmcostperrun) + (maintenancehourspermonth × hourly_rate)
- recommendedmonthlyprice = ceil(monthlydirectcost / (1 - targetgrossmargin))
- anchor_ranges (market‑checked): entry $99–$299/mo per unit; mid €500–€3,000/mo; high‑touch $2,500–$5,000+/mo. Use your cost model to pick a lane.
Packaging notes:
- Productize by agent: one setup + one monthly per agent. Offer a bundle discount for 2–3 agents.
- SLOs in proposal: parse rate ≥ 98%, FP rate ≤ 1%, reporting on Mondays by 09:15, incident response < 2 hours during support window.
- Change management: 90‑day review to tune thresholds and cut human review from 20% → 10% once stable.
Implementation checklist + Lisbon Test
Run this before you turn on fully unattended mode.
Go‑live checklist:
- One job per agent; JSON Schemas validated and versioned.
- Golden set (≥50) loaded; nightly eval passing; deploy gate active.
- Retries/backoff implemented globally; rate‑limit quotas set.
- Auth rotation tested; failure alerts include re‑auth instructions.
- PII masking verified end‑to‑end (logs, alerts, data store).
- Human‑in‑the‑loop paths wired; sensitive cases always human‑owned.
- HTML validation (reporting) green for two consecutive Mondays.
- KPIs visible to client (shared dashboard or weekly footer block).
- Incident SOP rehearsed once with a forced failure.
- Lisbon Test: pull laptop power and Wi‑Fi for 10 minutes mid‑run — does it recover without you?
Minimal SLO card to paste in your client doc:
- Lead‑qual: ≥98% parse rate; FP ≤1%; median FRT ≤5 min.
- Reporting: On‑time Mondays by 09:15; validation pass ≥98%.
- Inbox: Sensitive‑case catch rate ≥95%; deflection rate target agreed per client.
What to iterate next:
- Reduce manual review from 20% → 10% once KPIs hold for 4 weeks.
- Add canary checks (1 synthetic message/report per day) to detect silent failures.
- Graduate autonomy only after three green eval cycles post‑change (model or prompt).