Episode 7·

AI Agents That Actually Make Money: The Bounded Automation Playbook

Intro

This episode is for nomad founders tired of AI agents that break at 2 AM when you're twelve time zones away. You'll get three copy-paste blueprints with the operational backbone to run reliable automation businesses from sketchy café wifi.

In This Episode

Santi opens with a hard truth: 78% of the automations he audits are one OAuth expiry away from total failure. He and Kira dissect why "autonomous" agents fail under real client conditions—timeouts, rate limits, credential expiry—and introduce "bounded agents" with single jobs, strict schemas, and human escalation paths. They walk through three revenue-proven blueprints: lead qualification that delivers 27% more conversions, weekly reporting that never misses a Monday, and inbox triage that cuts response time by 60%. Each comes with exact tech stacks (n8n, Make, LLMs), monitoring setup, acceptance tests, and pricing bands from $1,500-$3,000 setup plus $500-$1,200 monthly. The episode culminates in the "Lisbon Test" checklist and a 14-day implementation sprint to ship your first bounded agent.

Key Takeaways

  • Bounded agents with single jobs and strict I/O schemas outperform autonomous agents because they handle the operational failures (timeouts, rate limits, auth expiry) that kill most automations
  • Three specific agent blueprints—lead qualification, weekly reporting, and inbox triage—can generate $1,500-$3,000 setup plus $500-$1,200 monthly revenue when packaged as productized services
  • The 'Lisbon Test' framework ensures your agents work asynchronously across time zones: if it can't recover from a 10-minute wifi outage without you, it's not nomad-ready

Timestamps

Companion Resource

Santi: Agents aren't a business model — revenue is. And right now, seventy-eight percent of the automations I audit are running without error handling, without monitoring, without even basic retry logic. They're one OAuth expiry away from total failure.

Kira: Seventy-eight percent.

Santi: I keep a spreadsheet. Look — everyone's launching "AI agents" this month. Autonomous this, agentic that. But here's what actually happened last week — I watched a founder's entire lead qualification system go down because Zapier hit a thirty-second timeout and nobody knew for four days.

Kira: Four days of dead leads.

Santi: Four days. Twenty-three leads. Probably six grand in lost revenue. All because they built an "agent" without understanding that agents fail. Constantly. Predictably. And usually at two AM your time when you're twelve time zones away from the client who's now freaking out.

Kira: Like your Thailand incident.

Santi: We don't talk about the Thailand incident.

Kira: Eight thousand a month client, gone, because Make.com changed their webhook format while you were island-hopping.

Santi: Right, but that's exactly why we built what we're showing today. Three bounded agents — not autonomous, bounded — that actually survive the Lisbon Test. Lead qualification that handles twenty-seven percent more conversions. Weekly reporting that never misses a Monday. Inbox triage that cuts response time by sixty percent. All with human escalation paths, retry logic, and acceptance tests that would have saved my eight grand.

Kira: And here's the thing — these aren't theoretical. We're running all three right now. From sketchy café wifi. With clients who don't even know we're in different countries every week.

Santi: Most AI agents are built to impress in demos and fail in production. Today we're building the opposite — agents that are boring, bounded, and bulletproof.

Kira: By the end of this episode, you'll have three complete agent blueprints with JSON schemas, acceptance tests, and the exact pricing math to charge fifteen hundred to three thousand setup plus five hundred to twelve hundred monthly per agent.

Kira: So let's start with the problem. Everyone's building "agents" right now, but most of them break the moment they hit real client conditions.

Santi: Yes! Okay, here's what actually fails — Zapier triggers timeout after thirty seconds. Google APIs throw four-twenty-nine rate limits when you least expect them. OAuth tokens expire while you're sleeping. And the worst part? Most of these failures are silent. Your automation just stops working and nobody knows.

Kira: I had a client last month whose entire reporting automation had been dead for a week. The n8n workflow was showing green checkmarks, but the data wasn't actually flowing because Google Sheets changed their authentication flow.

Santi: A week of no reports.

Kira: A week. And this is the important part — they weren't mad about the technical failure. They were mad that nobody caught it. That's what kills trust.

Santi: So here's what we mean by "bounded agent" — it's an automation with a single, specific job, strict input and output schemas, clear escalation paths, and — this is critical — monitoring that actually tells you when things break.

Kira: Not autonomous. Not trying to do everything. Just one job, done reliably, with guardrails.

Santi: Look at the numbers here. Google's Gemini API now supports JSON Schema for structured outputs. Studies show JSON has the highest parseability compared to YAML or XML. OpenAI's eval team says production reliability comes from golden datasets and explicit outcomes, not generic autonomy. We're talking about, what, like ninety-eight percent parse rates when you enforce schemas versus maybe seventy percent when you let the model freestyle.

Kira: Okay but here's what you're not considering — most nomad founders hear "JSON Schema" and "acceptance tests" and think this is way too technical for them.

Santi: No no no, it's literally just a template that says "the output must have these exact fields." That's it. You're not writing code. You're writing a checklist the AI has to follow.

Kira: Right. So let's show them. Blueprint number one — lead qualification agent.

Santi: This is the one that gets you that twenty-seven percent conversion uplift we mentioned. Here's the scope — parse inbound leads from WhatsApp, email, or web forms, ask three to five qualifying questions, score them, and either auto-book a call or escalate to a human.

Kira: And the key word there is "bounded." It's not trying to have a full sales conversation. It's not pretending to be human. It's asking specific questions and making a specific decision.

Santi: The stack is simple. n8n or Make for orchestration — I prefer n8n because of the error handling we'll talk about. Any LLM with function calling for the actual qualification logic. Calendar API for booking. CRM for tracking. And Langfuse for monitoring, which nobody talks about but is absolutely critical.

Kira: Walk me through what happens when a lead comes in.

Santi: Webhook receives the message. Transform it to our input schema — and I mean strict JSON with required fields like name, contact channel, and answers to our questions. The LLM scores based on your ICP criteria. If confidence is above seventy-five percent and the score says book, it creates the calendar event and sends the link. If confidence is below that threshold—

Kira: It escalates to a human.

Santi: Always. And here's what actually happened with that Reddit case study — small sales team, implemented exactly this flow with WhatsApp and AI voice handoff. Twenty-seven percent more conversions in thirty days. Not because the AI was smarter than humans, but because it responded instantly, twenty-four seven, and never forgot to follow up.

Kira: But here's what actually happens when you're on the ground — you get edge cases. What if someone messages at two AM drunk? What if they're clearly a competitor fishing for information? What if—

Santi: That's why we have acceptance tests! Fifty test cases minimum, including ten edge cases exactly like those. Missing fields, out-of-office replies, obvious spam. You run these every night, and if your parse rate drops below ninety-eight percent, you get an alert.

Kira: And this is where the human-in-the-loop piece matters. You're not trying to handle every edge case automatically. You're identifying them and routing them to a human with context.

Santi: The metrics you track — contact to qualified percentage, auto-book rate, false positive rate where you booked someone you shouldn't have, and first response time. We're seeing median response times under five minutes with this setup.

Kira: Under five minutes, twenty-four seven, from anywhere in the world.

Santi: Blueprint two — weekly reporting. This one's personal for me because it would have prevented my Thailand incident.

Kira: The one we don't talk about.

Santi: The one we're talking about right now apparently. Look — every service business needs to send clients regular updates. Weekly reports, monthly summaries, whatever. Most of us either do it manually, which takes hours, or we don't do it at all, which loses clients.

Kira: And here's what actually happens when you're traveling — Monday rolls around, you're supposed to send reports, but you're on a twelve-hour bus through Guatemala with no wifi. By the time you get online, it's Tuesday, the client's already annoyed, and you're playing catch-up.

Santi: So here's the bounded version. Single job — pull data from Google Analytics, Ads, CRM, whatever sources you have. Validate the numbers actually make sense. Generate a summary. Send it as both a Slack message and an HTML email. Every Monday at nine AM client time. No exceptions.

Kira: The Databox team built exactly this with n8n. Thirty-minute setup, then it runs forever.

Santi: But — and this is important — they noted that LLMs sometimes generate malformed HTML. That's a real failure mode. So you need validation. The output schema requires valid HTML. You run it through a validator before sending. If it fails, it goes to human QA.

Kira: Imagine you're a client. Every Monday morning, like clockwork, you get a report showing your numbers, trends, and what's working. You never have to ask for it. You never wonder what's happening. That's worth—

Santi: Five hundred to eight hundred a month, easy. For something that costs maybe ten dollars in API calls and platform fees.

Kira: But here's what you're not considering — what happens when the data source is down? What if Google Analytics changes their API? What if—

Santi: You handle missing data explicitly! If a data source fails, you don't hallucinate numbers. You write "data unavailable" and explain why. That's in the acceptance tests — if any metric is missing, you must output "unknown," not make something up.

Kira: And this is the important part — clients actually prefer honesty about missing data over made-up numbers that look right but aren't.

Santi: The validation is critical here. Totals must equal the sum of their parts. Percentages must be between zero and one hundred. Week-over-week changes must compute correctly. If any of these fail, human review.

Kira: You're basically building a system that can't lie to your clients.

Santi: Can't lie, can't fail silently, can't miss a Monday. That's the whole point.

Kira: Okay, blueprint three — inbox triage. This is the one I'm actually most excited about because it solves a problem every nomad has.

Santi: Email overload while you're offline.

Kira: Exactly. Imagine you're in Bali, you wake up to forty-seven emails. Half are spam, some are urgent client issues, a few are invoices, and buried in there somewhere is a legal threat from a competitor.

Santi: That's oddly specific.

Kira: It happened last month. But here's the thing — an inbox triage agent doesn't try to answer these emails. It just classifies them, sets priority, and routes them to the right place.

Santi: Single job. Bounded scope. The classification labels are fixed — support, sales, billing, spam, personal. That's it. No freestyle categorization.

Kira: And certain keywords trigger immediate escalation. "Legal," "refund," "lawsuit," "urgent" — these get flagged as sensitive and go straight to a human queue with the highest priority.

Santi: The Insightfactory case study showed exactly this pattern. Classification, routing, optional draft generation. But the key insight was that just routing emails correctly cut response time by more than half.

Kira: Because you're not wasting time sorting through spam to find the important stuff.

Santi: Here's the stack — Gmail or Outlook API for access, n8n or Make for orchestration, any LLM for classification, and then routing rules based on the output. The schema is dead simple — label, priority one through five, reason, and whether it's sensitive.

Kira: But here's what actually happens on the ground — you get emails that don't fit any category. Weird partnership proposals, random personal messages, edge cases you didn't anticipate.

Santi: Default to human review! If confidence is low or it doesn't match any label cleanly, it goes to the manual queue. You're not trying to handle everything automatically. You're trying to handle the eighty percent that's obvious, so you can focus on the twenty percent that needs your brain.

Kira: And the metrics here are beautiful. First response time drops from hours to minutes. Deflection rate — how many issues get resolved without human intervention — goes up. And most importantly, you never miss a sensitive email because it was buried under newsletters.

Santi: Now here's where most people stop. They build the agent, it works in testing, they ship it, and then it breaks at two AM and nobody knows.

Kira: Like your Make.com webhook situation.

Santi: Exactly like that. So you need an ops layer. This isn't the sexy part, but it's the part that keeps you from losing clients.

Kira: Start with error handling. Every platform has it, but nobody uses it properly.

Santi: n8n has this beautiful thing called Error Workflows. Any node fails, it triggers a global error handler that captures the workflow ID, the node that failed, the error message, and whether this was already a retry. From there, you classify the error.

Kira: Is it a timeout? Retry with backoff. Is it a four-twenty-nine rate limit? Wait and retry. Is it a four-oh-one authentication error? You need to refresh tokens. Is it a schema validation failure? Human QA immediately.

Santi: The retry logic is critical. Three attempts with exponential backoff — one minute, five minutes, fifteen minutes. Add some jitter so you're not hammering the API at exact intervals.

Kira: But here's what you're not considering — what about errors that can't be retried? What about data corruption? What about—

Santi: That's where the escalation SOP comes in! If it's still failing after three retries, or if confidence is below threshold, you open a ticket, notify on-call — could be Slack, could be PagerDuty — and you attach the execution URL and a redacted payload so someone can actually debug it.

Kira: Redacted is important. No customer emails or phone numbers in your error logs.

Santi: PII masking everywhere. Logs, alerts, databases. This is non-negotiable.

Kira: And then — this is the part most people skip — you do a post-incident review. You create a postmortem, figure out what broke, and add that case to your golden test set so it never happens again.

Santi: Every failure makes the system stronger. That's the whole point of bounded agents — they get more reliable over time, not more brittle.

Kira: So let's talk about this golden test set that Santi keeps mentioning.

Santi: This is straight from OpenAI's eval playbook. You build a set of fifty test cases per agent — real inputs with expected outputs. Ten of those should be edge cases. Missing fields, malformed data, rate limit simulations, duplicate requests.

Kira: You save these as a CSV. Input JSON, expected output JSON, and notes about what you're testing.

Santi: Then every night, you run your agent against this test set. You measure parse rate — how many outputs were valid JSON. Business rule pass rate — how many followed your actual rules, like not booking calls when confidence is low. False positive rate — how many times you booked someone you shouldn't have.

Kira: And if any of these metrics drop by more than two percent, you block deployment and get an alert.

Santi: This is how you catch regressions before your client does. Model updates, API changes, whatever — your eval harness catches it first.

Kira: But here's what actually happens when you're on the ground — you don't have time to build complex evaluation infrastructure. You're running a business from a café.

Santi: That's why we keep it simple! Langfuse for logging, a basic CSV for test cases, and a nightly cron job. Fifteen minutes to set up, then it runs forever. The tooling exists. OpenAI has guides. Langfuse is free for small teams. There's no excuse.

Kira: And this is the important part — you're not trying to test every possible scenario. You're testing the scenarios that actually happen in your business.

Santi: Alright, let's talk money. Because none of this matters if you can't charge for it.

Kira: Your favorite topic.

Santi: Look at the actual numbers here. Setup takes twelve to twenty hours at a hundred to one-fifty an hour. Platform fees are fifty to three hundred a month depending on volume. LLM costs are basically nothing — maybe ten to eighty dollars a month even at scale. Maintenance is one to three hours monthly.

Kira: So your total cost to deliver one agent is maybe two thousand setup and two hundred monthly.

Santi: Which means at sixty percent margins, you should charge — let me show you the math — three thousand setup and five hundred monthly minimum. But here's what actually happened when I looked at market pricing. Automation agencies are charging seven fifty to fifteen hundred setup, and ninety-nine to two ninety-nine monthly for basic automations. No AI, no monitoring, no evals.

Kira: So you're actually underpriced at three thousand.

Santi: You're underpriced! For a bounded agent with monitoring, retries, acceptance tests, and SLAs, you should be at the high end. Fifteen hundred to three thousand setup, five hundred to twelve hundred monthly.

Kira: But here's what you're not considering — how do you position this to clients who think AI should be cheap?

Santi: You don't sell AI. You sell outcomes. Lead qualification agent — "twenty-seven percent more qualified leads, responding twenty-four seven." Weekly reporting — "Never miss another Monday update, with data validation you can trust." Inbox triage — "Cut response time by sixty percent, never miss a critical email."

Kira: And you package it as a productized service. One agent, one price, clear SLAs.

Santi: Parse rate above ninety-eight percent. False positive rate below one percent. First response time under five minutes. On-time delivery for reports. These are measurable, trackable, and you put them in the contract.

Kira: This changes the conversation from "how much for an AI thing" to "how much for a system that delivers these specific results."

Santi: And if you bundle three agents together, you can offer a package discount. Full automation suite for forty-five hundred setup, fifteen hundred monthly. That's eighteen thousand ARR per client for something that costs you maybe three thousand a year to run.

Kira: Okay, so before anyone ships any of this, they need to run what Santi calls the Lisbon Test.

Santi: It's not just what I call it, it's what it is. Can you run this from a Lisbon café with sketchy wifi?

Kira: Walk through the checklist.

Santi: One job per agent — check. JSON schemas validated and versioned — check. Golden test set with at least fifty cases — check. Retries and backoff implemented globally — check. Auth rotation tested, failure alerts wired up — check. PII masking verified end-to-end — check.

Kira: Human-in-the-loop paths for sensitive cases.

Santi: HTML validation for any formatted output. KPIs visible to the client. Incident SOP rehearsed at least once. And here's the real test — pull your laptop power and wifi for ten minutes mid-run. Does it recover without you?

Kira: That's actually brilliant. Force a failure and see if your system handles it.

Santi: Because failures will happen. APIs will change. Tokens will expire. Services will go down. The question isn't whether your agent will fail — it's whether it will recover gracefully and tell you about it.

Kira: And this is the important part for nomads — all of this has to work asynchronously. You can't be on call twenty-four seven when you're traveling.

Santi: The whole system is designed for async. Retries handle transient failures. Escalations go to a queue, not a phone call. Post-mortems happen when you're back online. The client gets consistent service whether you're in Lisbon or on a boat with no signal.

Kira: So let's make this real. You're listening to this, you want to build one of these agents. What's the fourteen-day sprint look like?

Santi: Days one and two — pick one agent, define its single job, write the ICP and success criteria. Don't overthink this. Lead qual, reporting, or inbox triage. Pick one.

Kira: Days three and four — wire up the connectors. CRM, calendar, inbox, whatever you need. Drop in the JSON schemas and prompts from the blueprint.

Santi: Days five and six — build your golden test set. Fifty rows minimum, saved as CSV. Include edge cases. This is tedious but critical.

Kira: Days seven and eight — add the acceptance tests and eval harness. Set up nightly runs. This is where you catch problems before production.

Santi: Days nine and ten — implement error handling. Retries, backoff, global error workflow, alerts. Mask PII in logs. This is the ops layer that keeps you from losing clients.

Kira: Day eleven — soft launch with a hundred percent human review. Every output gets checked before it goes to the client.

Santi: Day twelve — drop to twenty percent spot-checks for low-risk cases. Keep a hundred percent review for anything sensitive.

Kira: Days thirteen and fourteen — track your KPIs. Parse rate, false positive rate, response time, cost per run. If everything's green, package it, price it, and ship to your first client.

Santi: Fourteen days from idea to revenue. That's the whole point of bounded agents — they're shippable in two weeks, not two months.

Kira: But here's what actually happens on the ground — you'll hit snags. The API documentation will be wrong. The LLM will hallucinate. The client will ask for features you didn't plan.

Santi: Stay bounded! The moment you try to make the agent do everything, you're back to the failure modes we started with. One job, done well, with monitoring. That's it.

Kira: And this is where the resource package comes in. We've built out the complete blueprint — all the JSON schemas, the prompts, the acceptance tests, the SOP templates, even a pricing calculator.

Santi: It's literally copy-paste ready. The lead qual schema, the reporting validation rules, the inbox classification labels. Plus a KPI dashboard template so you can track everything we talked about.

Kira: Grab the Bounded Agent Blueprint in the show notes — it's the exact SOPs, JSON schemas, and KPI dashboard we're talking through today.

Kira: So here's what we just built — three agents that actually survive real client conditions. Lead qualification that responds in under five minutes. Weekly reporting that never lies about data. Inbox triage that catches sensitive emails before they become problems.

Santi: And the difference between these and the "autonomous agents" everyone's hyping? Ours actually work when your wifi cuts out. When APIs change. When you're twelve time zones away from everything breaking.

Kira: The fourteen-day sprint we laid out — that's not theoretical. Pick one agent. Monday you start defining the scope. Two weeks from Monday, you're billing your first client fifteen hundred setup plus five hundred monthly.

Santi: And here's the challenge — ship one bounded agent in the next fourteen days. Just one. Don't try to build all three. Don't add features. One job, strict schemas, monitoring, done.

Kira: Tag us when you ship it. We want to see what breaks and how you fix it.

Santi: Because things will break. That's the whole point. Build for failure, monitor everything, and charge like you're running critical infrastructure.

Kira: Which you are.

Santi: Until next week — keep building, keep shipping, and remember: agents aren't a business model.

Kira: Revenue is.

AI agentsautomationbounded agentslead qualificationclient reportinginbox triagen8nMake.comerror handlingmonitoringJSON schemaproductized servicepricing strategynomad businesslocation independence