Episode 3·

Bulletproof Your AI Automations: Monitoring, Retries, and 2‑AM Alerts (Zapier, Make, n8n)

SpotifyRSS Feed

Intro

This episode is for nomad founders running client deliverables on automation platforms who need production-grade reliability without a full-time DevOps team. You'll get the exact setup to catch failures, route alerts, and implement retries that work across time zones.

In This Episode

Santi shares the painful story of losing an $8,000/month client when a Make.com API change broke his automation for four days while he was offline in Thailand. We walk through platform-specific error handling for Zapier (custom error handlers plus Zapier Manager alerts), Make (incomplete executions with auto-retry), and n8n (global error workflows with per-node controls). We cover retry patterns with exponential backoff, incident classification frameworks (P0/P1/P2), and the January 2026 n8n security vulnerability that exposed 59,000 instances. The episode includes downloadable templates: a Notion incident runbook, CSV error taxonomy, and copy-paste alert recipes for immediate implementation.

Key Takeaways

  • Set up Zapier Manager to Slack alerts immediately — custom error handlers suppress default email notifications, so you need separate monitoring
  • Enable Make's incomplete executions with Break error handlers for automatic retries, plus consecutive error thresholds to auto-deactivate runaway scenarios
  • Build n8n global error workflows with heartbeat monitoring since the Error Trigger won't fire if the main workflow's trigger node fails

Timestamps

Santi: —and by the time I got the Slack notification, it had been broken for four days. Four days! Eight thousand a month client, gone. Because Make.com changed their API and I was island-hopping in Thailand with no monitoring set up.

Kira: Wait, no error emails? No alerts?

Santi: Nothing. The automation just... stopped. Silent failure. And here's what actually happened — the webhook format changed, my scenario kept running but the data wasn't getting through, and Make marked everything as successful because technically the scenario executed.

Kira: The scenario executed but nothing happened.

Santi: The scenario executed but nothing happened. So we're talking about, what, like forty-eight leads that just vanished into the void while I was sitting on a beach in Koh Samui thinking everything was fine.

Kira: And this is exactly why I keep a human in the loop for—

Santi: No no no, that's not the solution. The solution isn't adding more humans. The solution is making the automation tell you when it's dying. Look — every platform has error handling built in. Zapier has custom error paths. Make has incomplete executions. n8n has a whole error workflow system. We just don't use them because we're too busy shipping the next feature.

Kira: Okay but here's what you're not considering — even with error handling, you can still have silent failures. What if the trigger itself stops firing? What if you're getting data but it's malformed? What if—

Santi: Yes! Exactly! That's the whole point — you need layers. Error handlers for the obvious stuff, monitoring for the silent stuff, and retries for the transient stuff. Plus — and this is the important part — you need to know which failures to panic about at two AM and which ones can wait until morning.

Kira: If you're running client automations on Zapier, Make, or n8n without proper error handling, monitoring, and retry logic, you are one API change away from losing your biggest client. And when you're twelve time zones away, you won't even know it happened.

Santi: Today we're fixing that. By the end of this episode, you'll have the exact setup to catch failures, route alerts, and auto-retry the stuff that can be retried — all while you're offline.

Santi: Let me paint you the picture of how automations actually fail in production. Not the clean errors you test for — the messy ones that happen at three AM.

Kira: Rate limits that suddenly drop from a thousand requests per hour to fifty.

Santi: OAuth tokens that expire silently after ninety days even though the docs say they're permanent.

Kira: Schema changes where a field that was always a string suddenly becomes an array.

Santi: Provider outages that return success codes but empty responses. And here's what actually happened last month — OpenAI's API started returning HTTP 200 success but with an error message in the body. Every automation platform marked it as successful. Every single one.

Kira: Because they check the status code, not the content.

Santi: Right. So your automation thinks it worked, your monitoring thinks it worked, but your customer gets blank emails. This is why you need defense in depth. Not just error handling — layers of error handling.

Kira: Okay, let's get specific. Platform by platform. Starting with Zapier because that's where most people start.

Santi: Zapier. Here's the setup that would have saved me in Thailand. First thing — custom error handling. You add it to any step that touches external APIs. Click the three dots on the step, select "Error Handler," and you get this whole alternate path that runs when things break.

Kira: But here's the catch that nobody tells you—

Santi: The email notifications stop.

Kira: The email notifications stop! When you add a custom error handler, Zapier stops sending you those default error emails. It assumes you're handling it yourself.

Santi: Which means you need to wire up your own alerts. This is where Zapier Manager comes in. You build a separate Zap — trigger is "New Zap Error" from Zapier Manager, action is "Send Channel Message" to Slack. Now every error hits your Slack immediately, even if you have custom handling on the original Zap.

Kira: Even with the error handler running?

Santi: Even with the error handler. That's the beautiful part. The Manager trigger still fires regardless. So you can handle the error gracefully for the user while still getting notified that something went wrong.

Kira: And on team accounts?

Santi: On team accounts it gets weird. The error notification goes to whoever owns the errored Zap, not whoever built the monitoring Zap. So if you're managing client Zaps under different ownership, you need to check who actually gets the alerts.

Kira: This is exactly the kind of thing that breaks when you're traveling. You think you've set up monitoring, but the alerts are going to an email you don't check.

Santi: Alright, Make. Make is actually better at this than Zapier in some ways. They have this thing called incomplete executions.

Kira: Which sounds like a bug but is actually a feature.

Santi: Right. When a module fails, instead of just stopping, Make stores the entire execution state — all the data, where it failed, everything. And then — this is the key part — it can retry automatically.

Kira: Walk me through the setup.

Santi: Scenario settings, toggle on "Store incomplete executions." That's step one. Step two, you add a Break error handler to any module that might fail. The Break handler has two fields that matter: number of attempts and delay between retries.

Kira: What do you set them to?

Santi: Three attempts, exponential backoff. So like, one minute, then five minutes, then twenty minutes. Gives transient issues time to resolve. Rate limits reset. APIs come back online.

Kira: But what if it's not transient? What if the API actually changed?

Santi: That's where the consecutive errors setting comes in. Scenario settings again — "Number of consecutive errors." Set it to, let's say, five. After five failures in a row, Make automatically deactivates the entire scenario.

Kira: It just turns itself off.

Santi: Turns itself off and sends you a notification. Better to have a scenario stop than keep burning through operations on something that's fundamentally broken. Oh, and here's a gotcha — instant trigger scenarios? They deactivate after the first error. Not five. One.

Kira: Wait, so a webhook scenario would just stop after one failure?

Santi: One failure and it's done. Make assumes if your webhook is failing, something is seriously wrong. You need to know immediately.

Kira: Let's talk about n8n. Because if you're self-hosting, the error handling is completely different.

Santi: n8n is... look, it's the most powerful but also the most complex. You have node-level error handling and workflow-level error handling, and they interact in ways that aren't always obvious.

Kira: Start with the global error workflow. This is the thing that catches everything.

Santi: Right. You create a dedicated workflow with the Error Trigger node. This trigger fires whenever any workflow in your n8n instance throws an error. From there, you can route to Slack, Telegram, email, whatever. The template library has a great multi-channel example — workflow 5629 if anyone wants to look it up.

Kira: And it includes the execution context?

Santi: Everything. Which workflow failed, what node, what the error was, even the input data that caused it. Except — and this is important — if the error happens in the trigger node itself, you get no error data.

Kira: No error data?

Santi: The trigger never fired, so there's no execution to report on. This is your silent failure scenario. Your webhook stops receiving data, or your schedule trigger breaks, and the error workflow has nothing to tell you.

Kira: So you need a heartbeat monitor.

Santi: Exactly. A separate workflow that runs every hour and checks "did my main workflow run in the last hour?" If not, alert. It's primitive but it works.

Kira: What about node-level handling?

Santi: Each node has an "On Error" setting. Three options: Stop Workflow, Continue, or Continue Using Error Output. Most of the time you want Stop. But for non-critical stuff — like enrichment data that's nice to have — set it to Continue.

Kira: And retry logic?

Santi: Toggle on "Retry On Fail" and set your max attempts. But here's what nobody talks about — n8n doesn't have built-in exponential backoff. It just hammers the retry immediately. So if you're hitting rate limits, you need to add manual delays.

Kira: Or use the Wait node between retries.

Santi: Or use the Wait node, yeah. But that gets complex fast.

Kira: We should talk about the January incident.

Santi: The Ni8mare vulnerability. CVE-twenty twenty-six-21858. If you're self-hosting n8n and you haven't patched to 1.121.0, you need to stop this podcast and update right now.

Kira: Fifty-nine thousand exposed instances as of January eleventh.

Santi: Fifty-nine thousand. And it's a CVSS 10.0 — remote code execution, no authentication required. If you have public webhooks or form triggers exposed to the internet, you're vulnerable.

Kira: This is the other side of self-hosting. You're responsible for security patches.

Santi: Look, I love n8n. I run two instances. But if you're going to self-host, you need update notifications, you need to actually apply patches, and you need to restrict public endpoints. Basic security hygiene that nobody talks about in the "self-host everything" movement.

Kira: Let's zoom out for a second. We've talked about platform-specific error handling. But there are patterns that apply everywhere.

Santi: Exponential backoff with jitter.

Kira: You and your jitter.

Santi: No, seriously! AWS has been writing about this since 2015. When you retry failed requests, you don't want all your retries happening at the exact same intervals. You add random jitter — random delays — to spread them out.

Kira: Because otherwise you create a thundering herd.

Santi: Exactly. Your API is down, comes back up, and immediately gets hammered by all the retries happening at once. With jitter, the retries spread out naturally. The load distributes.

Kira: And idempotency?

Santi: Yes! This is huge and nobody talks about it. Idempotency means you can safely retry the same operation multiple times without creating duplicates. Before you create a record, check if it already exists. Use unique identifiers that survive retries.

Kira: This is where people create duplicate invoices.

Santi: Or send the same notification five times. I've seen automations that charged customers multiple times because the payment succeeded but the confirmation step failed, so the whole thing retried.

Kira: We need to talk about incident classification. Because not every error is a hair-on-fire emergency.

Santi: P0, P1, P2. That's the framework. P0 is "customer-facing service is completely down." P1 is "degraded but limping along." P2 is "broken but nobody notices yet."

Kira: And you handle them differently.

Santi: P0 pages you immediately. Telegram, phone call, whatever it takes. P1 sends a Slack alert that you'll see within an hour. P2 goes into a queue for tomorrow.

Kira: And you need different runbooks for each.

Santi: Here's what actually happened when I finally built this out. P0 runbook: First, pause billing if you're on usage-based pricing. Second, send the "we're aware and working on it" message to affected clients. Third, implement the manual workaround while you fix the automation.

Kira: What's the manual workaround?

Santi: Whatever you did before you automated it! Export the data, process it locally, upload the results. It's ugly but it keeps the client happy while you debug.

Kira: What about coverage when you're actually offline?

Santi: You need a buddy system. Another nomad who can receive your P0 alerts and at least acknowledge them. They don't need to fix anything — just send the "we're aware" message and create a ticket.

Kira: Let's make this concrete. What's the actual setup someone should implement today?

Santi: Three things, in order. First, add error notifications to your existing automations. Zapier Manager to Slack, Make webhook to Slack, n8n Error Workflow to Telegram. Just get the alerts flowing.

Kira: Even if you don't handle them gracefully yet.

Santi: Right. Better to know about failures than not. Second, add retry logic to anything that touches external APIs. Start conservative — two retries with thirty-second delays. You can tune it later.

Kira: And third?

Santi: Document your runbooks. What do you do when each type of error happens? Who do you notify? What's the manual fallback? Put it in a Notion doc and share it with anyone who might need to cover for you.

Kira: Because you will be offline when things break.

Santi: You will be on a boat to Gili Air with no internet when your biggest automation fails. That's not pessimism, that's just the nomad reality.

Kira: There's another layer to this that we haven't talked about. The silent failures.

Santi: The ones that don't throw errors.

Kira: Right. Your automation runs successfully, but the output is wrong. Or it's not running at all because the trigger stopped firing.

Santi: This is where you need volumetric monitoring. "I expect fifty leads per day. If I get less than forty, something's wrong."

Kira: How do you build that?

Santi: Depends on the platform. In Zapier, you can use Storage to count executions and alert if it's too low. In Make, you use a data store. In n8n, you write to a database and query it.

Kira: But that's adding complexity to catch edge cases.

Santi: It's adding one check per day to catch the failures that cost you clients. Remember my Thailand story? If I'd had a simple "did any leads come through today?" check, I would have known in twenty-four hours, not four days.

Kira: Fair point.

Kira: We've covered a lot. Let's give people the actual checklist.

Santi: The minimum viable error handling stack. Ready?

Kira: Go.

Santi: One: Error notifications. Pick one channel — Slack, Telegram, whatever — and route all errors there. Zapier Manager to Slack. Make webhook to Slack. n8n Error Trigger to Telegram.

Kira: Two?

Santi: Retry logic on external API calls. Two attempts minimum, exponential backoff. In Zapier, use error handlers. In Make, use Break handlers. In n8n, use the Retry On Fail setting.

Kira: Three?

Santi: Classification and runbooks. Define P0, P1, P2. Write down what to do for each. Share it with your coverage buddy.

Kira: Four?

Santi: Heartbeat monitoring for critical workflows. If something should run daily, check that it ran. Simple as that.

Kira: And five?

Santi: Security patches if you're self-hosting. Set up update notifications. Actually apply updates. Don't be one of the fifty-nine thousand exposed n8n instances.

Kira: That's the foundation.

Santi: That would have saved my eight-thousand-dollar client. That would have caught the Make API change within hours, not days.

Kira: Before we wrap up, let's address the elephant in the room. This feels like a lot of work.

Santi: It is a lot of work.

Kira: You're agreeing with me?

Santi: Of course I'm agreeing! Error handling isn't fun. It doesn't demo well. Clients don't see it. But here's what actually happened — after I lost that client, I spent two weeks building proper error handling into everything. Two full weeks.

Kira: Two weeks. That's basically a month of client revenue just to fix what should have been built right the first time.

Santi: Hundred hours at my consulting rate? That's twelve thousand dollars of opportunity cost. But you know what? I haven't lost a client to a technical failure since. Not one.

Kira: So it's infrastructure investment.

Santi: It's the difference between a freelance operation and a real business. Freelancers fix things when they break. Businesses prevent them from breaking.

Kira: Or at least know immediately when they break.

Santi: Right. You can't prevent every failure. But you can prevent every failure from becoming a client loss.

Kira: The Lisbon Test for error handling.

Santi: Oh, here we go.

Kira: No, seriously. Can your error handling setup notify you at a Lisbon café with sketchy wifi? Can you diagnose the issue from your phone? Can you implement a workaround without your laptop?

Santi: Those are actually good criteria.

Kira: Mobile-first error handling. If you can't fix it from your phone, your setup is too complex.

Santi: Or at least implement the workaround from your phone. Pause the automation, notify the client, delegate the manual process. Full fix can wait until you have a real keyboard.

Kira: And this is why Telegram beats Slack for alerts.

Santi: Telegram works everywhere. Slack needs a stable connection. When you're on a ferry between islands, Telegram will get through when nothing else will.

Kira: Let's talk about the downloads we're providing.

Santi: Three things. First, the Notion incident runbook template. It's got P0, P1, P2 classifications, response procedures, escalation paths, and a place to document your manual workarounds.

Kira: Second?

Santi: The CSV error taxonomy. Transient versus permanent versus authentication errors. How to identify them, how to handle them, when to retry.

Kira: And third?

Santi: Copy-paste alert recipes. Zapier Manager to Slack. Make webhook to Slack. n8n Error Trigger to Telegram. Just duplicate them and add your credentials.

Kira: These are the exact setups we use?

Santi: The exact setups. Well, mine are more complex now, but these are what I started with. And honestly? They catch ninety percent of issues.

Kira: Where do people get these?

Santi: Show notes. Everything's in the show notes. Including links to the specific documentation pages we referenced.

Kira: Final thoughts. What's the one thing people should do right after listening to this?

Santi: Check if you're getting error emails from your automations. Seriously. Right now. Check your spam folder. I guarantee someone listening has errors they don't know about.

Kira: Because the default notifications are going to an old email.

Santi: Or spam. Or they turned them off months ago because they were too noisy. Step one is just knowing when things fail.

Kira: And step two?

Santi: Pick your most critical automation — the one that would hurt most if it failed — and add error handling today. Not tomorrow. Today. It'll take thirty minutes and it might save your biggest client.

Kira: The eight-thousand-dollar lesson.

Santi: The eight-thousand-dollar lesson that I'm sharing so you don't have to learn it yourself.

Kira: So here's the thing — Santi lost eight thousand dollars a month because he didn't have thirty minutes of error handling set up. That's a thousand-dollar-per-minute mistake.

Santi: When you put it that way, it sounds even worse.

Kira: But it's also the perfect illustration. Every automation you're running right now without proper error handling is a ticking time bomb. The question isn't if it will fail — it's whether you'll know when it does.

Santi: Look, go check your error notifications right now. I'm serious. Pause this, check your spam folder, check your Zapier dashboard, check your Make scenario history. I bet you money there's at least one error you didn't know about.

Kira: And then pick one automation — just one — and add the basics. Error handler, alert to Slack or Telegram, two retries with a delay. Thirty minutes of work that could save your biggest client.

Santi: The downloads are in the show notes. Notion runbook template, error taxonomy CSV, and the exact alert recipes we use. Everything you need to go from zero to protected.

Kira: This is The Stateless Founder. I'm Kira.

Santi: I'm Santi. Build it bulletproof, or don't build it at all.

automation error handlingZapierMaken8nmonitoringretriesalertsincident responsenomad businessreliability