Kira: Three outages. Three different providers. Thirty-four days.
Santi: That's the number. April sixth — Claude goes down, elevated errors across login, chats, voice. April seventh — Claude again, fourteen thirty-two UTC, same thing. And five weeks before that, March fourth — OpenAI's API throws elevated error rates for thirty minutes across multiple models.
Kira: And if you're running client delivery on any one of those providers — just one — you had dead air. No fallback. No queue. Just... silence while your client's waiting on a deliverable nine time zones away.
Santi: I pulled up my logs from April seventh. My content repurposing tool makes about two hundred Claude calls a day. Between fourteen thirty-two and fifteen fifty-nine UTC — ninety minutes — every single one failed. That's roughly twelve calls that just... vanished.
Kira: Twelve calls. How many of those were client-facing?
Santi: Eight. Eight client deliverables that would've been late if I didn't already have a failover running.
Kira: And most people don't.
Santi: Most people don't. Most people are sitting on a single-vendor stack hoping Anthropic or OpenAI never has a bad Tuesday. And now Google's adding a new wrinkle — spend caps on the Gemini API that can hard-stop your keys mid-project.
Kira: So your provider goes down, or your budget runs out, and either way — your business stops.
Santi: Either way, your business stops. Unless you build the thing we're showing you today.
Santi: Every hour your AI stack is down while you're asleep is a client deliverable that doesn't ship, a lead that doesn't get qualified, revenue that evaporates before you even know there's a problem. And right now, most nomad founders have zero protection against that.
Kira: Today we're building the fix — a three-tier AI outage failover you can ship in a weekend. Hot, warm, cold. With budget guardrails so your API bill doesn't bankrupt you while the failover's doing its job.
Kira: So for anyone who wasn't refreshing status pages two weeks ago — April sixth and seventh, back-to-back days, Anthropic's Claude goes down. Elevated errors on login, chats, voice, Claude Code. The April sixth incident lasted about ninety minutes. April seventh, same symptoms starting at fourteen thirty-two UTC.
Santi: And that April seventh one hit me mid-morning in Lisbon. My content tool is firing Claude calls for three clients, and suddenly every response is an error. Rewind five weeks — March fourth — OpenAI logs elevated API error rates for thirty minutes across several models. Queued infrastructure actions all executed at the same time.
Kira: So that's two providers, three incidents, thirty-four days. And then there's the budget angle. March sixteenth, Google announces Project Spend Caps for the Gemini API. You set a monthly limit per project in AI Studio. Sounds great — until you hit the cap and your keys just stop working. A user on the Gemini subreddit set an eight-dollar cap and woke up to four twenty-nine errors the next morning.
Santi: Eight dollars. But the mechanism is real. Google's docs say there's about a ten-minute enforcement delay, so you can overshoot before it kicks in. And if your prepay credit balance hits zero at the billing account level — not the project level — all API keys across all linked projects stop simultaneously.
Kira: All of them. Dead. So you've got three failure modes. Provider outage. Provider rate spike. Budget hard-stop. Any one of these kills your async ops on a single vendor.
Santi: And if you're nine time zones from your client, you might not even know for hours.
Kira: Or four days, in your case.
Santi: That was before the failover. We're past that now.
Kira: There's also a cautionary tale on the Google Cloud subreddit — a founder reporting roughly a hundred and twenty-eight thousand dollars in unauthorized Gemini API usage. Google denied the adjustment. Now, that's anecdotal, we can't verify the full details, but the thread is real and the risk it illustrates is real. Without budget guardrails, a runaway loop can do serious financial damage.
Santi: Which is why the budget piece isn't optional in what we're building. It's baked into the same system.
Santi: Okay, so the architecture. Three tiers. Hot, warm, cold. Each one catches what the tier above it can't handle.
Kira: Walk me through hot first.
Santi: Hot is your same provider, different model or endpoint. So if you're running Claude Sonnet and it starts throwing errors, your hot failover switches to Claude Haiku on a different base URL. Same provider, same auth, different model. A lot of outages are partial — they hit one model or one endpoint but not everything.
Kira: But wait — most of these APIs are global endpoints, right? It's not like AWS where you can fail over to a different region. If Anthropic's having a bad day, does switching models actually help?
Santi: Sometimes yes, sometimes no. And that's an honest answer. There's a practitioner blog from Grizzly Peak Software that makes this exact point — most LLM APIs expose global endpoints, so a hot regional failover doesn't really exist the way it does for traditional infrastructure. But model-level failover does help for partial outages. April sixth, for example — the elevated errors were hitting login and auth. If your API calls were already authenticated, some models were still responding. So hot isn't about geography. It's about having a second route on the same provider that might still be alive.
Kira: Okay. So hot is your first line. What's warm?
Santi: Warm is a completely different provider. If Anthropic is down across the board, you route to OpenAI or Gemini. Same capability — you need a model that can handle the same task — but different infrastructure, different failure domain.
Kira: And this is where it gets tricky for people who aren't deeply technical. Because these APIs aren't identical. The request formats are different, the response schemas are different, tool calling works differently—
Santi: Right. So the way you handle that is through an OpenAI-compatible gateway layer. Tools like Open WebUI, Ollama Proxy, various vendor gateways that normalize the request and response format so your application code doesn't have to know which provider it's talking to. You write one interface, the gateway translates.
Kira: And those gateways are reliable?
Santi: They're another dependency, which is a fair concern. But the alternative is writing provider-specific code for every failover path, which is worse. The key is — test your warm tier before you need it. Especially tool calling and JSON mode. Those are the two things that break most often across providers.
Kira: So test it on a Tuesday, not during an outage.
Santi: Exactly. Now — the mechanism that decides when to switch from hot to warm is a circuit breaker. Martin Fowler wrote the canonical description of this pattern. Your system tracks consecutive failures to a given route. After five failures in a row — that's the threshold I use — the circuit opens and stops sending traffic to that route. It enters a half-open state where it sends a probe request every thirty seconds. If the probe succeeds twice, the circuit closes and traffic resumes.
Kira: So it's not just retrying blindly.
Santi: No, and that's critical. Blind retries during an outage create what Marc Brooker at AWS calls a thundering herd — every client retries at the same time, which makes the outage worse. So you add jitter to your retries. Exponential backoff with random jitter. Your first retry waits two hundred milliseconds plus some random offset. Second retry waits longer. Third retry longer still. Each one slightly randomized so all your clients aren't hammering the endpoint in sync.
Kira: I actually ran into this with my agency. Not with LLM calls, but with a webhook integration. We had a Make scenario that retried on failure, and during a Slack API hiccup, it fired the same webhook forty-seven times in two minutes. My client got forty-seven duplicate Slack messages.
Santi: Forty-seven. That's exactly the nightmare that makes clients lose trust in automation overnight. One hiccup and suddenly they're getting spammed by the system that's supposed to make their life easier.
Kira: Right. Which is why idempotency keys matter. Stripe popularized this — you attach a unique key to every mutating request, and if the same key shows up twice, the API returns the original response instead of processing it again. For LLM calls, you hash the user ID plus the job ID plus the input, and that becomes your idempotency key.
Santi: So even if your retry logic fires the same request three times, the downstream system only processes it once.
Kira: Okay. So we've got hot, we've got warm, circuit breakers deciding when to switch, jittered retries so we're not making things worse, idempotency keys so retries are safe. What's cold?
Santi: Cold is the last resort. Everything's down — hot failed, warm failed. Cold means graceful degradation plus human in the loop. The request goes into a durable queue — Redis with BullMQ if you're in Node, or a simple Redis list if you're in Python — and a notification fires to your ops channel. Slack, email, whatever you monitor.
Kira: And this is the part I actually care about most. Because cold is where the client experience lives or dies. If hot and warm are both down, something is seriously wrong. Your client doesn't know about your architecture. They just know their deliverable is late.
Santi: Right.
Kira: So what does the client see?
Santi: They see a two-oh-two response — accepted, processing. Not an error. The system acknowledges the request, queues it, and your notification tells you or your VA that manual intervention is needed. You have a message template ready to go — we're experiencing a brief delay, your deliverable will be ready within X hours. Professional, honest, and it buys you time.
Kira: And this is the important part — that message template needs to exist before the outage. You don't want to be drafting client communications at three AM from a hostel in Chiang Mai while your queue is backing up.
Santi: No. You write it once, you put it in your runbook, and your VA knows where to find it.
Kira: I keep mine in a Notion doc that my team can access. Three templates — brief delay, extended outage, and service degradation. Each one has the Slack command to send it.
Kira: Now the budget piece. Because outages aren't the only thing that can stop your stack. Running out of money can too.
Santi: So Google's new Project Spend Caps are useful here, but they're not sufficient on their own. The caps are set in AI Studio under the Spend tab. You pick a monthly dollar amount per project. When you hit it, your keys stop working. But the docs flag these as experimental, and there's that ten-minute enforcement delay.
Kira: Meaning you could blow past your cap by ten minutes of traffic before it kicks in.
Santi: Right. So you need an app-level guardrail on top of the provider-level cap. A simple webhook endpoint. Your billing pipeline posts the current spend percentage. When it hits eighty percent of your monthly budget, the webhook flips a Redis flag — pause AI tasks. Your queue workers check that flag before processing any new job.
Kira: Wait, where does the spend percentage come from? Is Google sending you that data automatically?
Santi: Not directly from AI Studio — that's actually a gap in the current tooling. We couldn't find evidence of a native webhook from AI Studio for spend cap events. What you can do is set up Google Cloud Billing budget alerts, which push to Pub/Sub, and then a small Cloud Function calls your webhook. Or you run a daily cron that checks your spend and posts the value. Either way, you're not relying solely on the hard cap.
Kira: So the hard cap is your safety net. The app-level pause at eighty percent is your actual control.
Santi: Exactly. Now — the obvious question. What does all this cost? Because if the failover costs more than the problem it solves, nobody's going to build it.
Kira: Right.
Santi: The code is free — open source libraries. Cockatiel for TypeScript gives you circuit breakers, retries, and timeouts. Tenacity plus PyBreaker for Python. BullMQ for the queue. All free. The infrastructure cost is a small managed Redis instance — Railway, Render, Upstash — five to fifteen dollars a month. If you're already running Redis, the incremental cost is zero. And you only pay for warm-tier calls when the hot tier is actually down.
Kira: So we're talking twenty to maybe eighty dollars a month total, and most months it's closer to the low end.
Santi: Versus the cost of a missed client deliverable. Which for me, historically, was eight thousand dollars.
Kira: The Thailand number.
Santi: The Thailand number. Even if your clients are smaller — say you're running a five-hundred-dollar-a-month engagement — one missed deadline, one "where's my report" email that you can't answer because you're asleep, and that client starts shopping.
Kira: The SOP we put together has a spreadsheet tab where you can plug in your own numbers — monthly requests, average tokens per request, token prices by provider, your Redis cost, your estimated failure rate. It's not precision modeling, it's planning math. But it makes the case pretty clearly.
Santi: So let me walk through what you're actually building. The core is a JSON routing map — a config file your app reads at boot. Three sections — hot, warm, cold. Each hot and warm entry has a provider name, a model identifier, and a base URL that pulls from environment variables. Never hardcoded.
Kira: And the actual router — the code that reads this map and makes the calls — how big is that?
Santi: In Node with Cockatiel, the core router function is maybe forty lines. It loops through hot routes first, applies the circuit breaker and retry policy to each call. If all hot routes fail, it loops through warm routes with the same policies. If everything fails, it enqueues to the cold queue and returns a two-oh-two. Python is similar size — Tenacity handles the retries with jitter, PyBreaker handles the circuit breaker, and the routing logic is a for loop over your hot list, then your warm list, then a Redis push for cold.
Kira: Okay but what happens when your warm provider has different behavior for tool calling? You're using Claude's tool use format, you fail over to OpenAI, and suddenly your function calls don't parse correctly.
Santi: Yeah, that's real. And it's the number one thing to test before you need it. The gateway layer normalizes most of it, but tool calling schemas and JSON mode behavior vary across providers. Pick one warm provider, test it against your actual prompts with tool calling enabled, and document the differences. If there are breaking differences, your warm tier for that specific task might need a simpler prompt that doesn't rely on tool calling. Degrade the capability slightly rather than fail completely.
Kira: So warm doesn't have to be identical. It just has to be good enough.
Santi: Good enough to keep the client from noticing. For most tasks — summarization, drafting, classification — totally achievable. For complex multi-step tool use, you might need to accept that warm gives you eighty percent of the capability and cold handles the rest.
Kira: And the budget webhook — that's separate?
Santi: Two endpoints. One receives the spend alert and flips the pause flag. The other is a manual resume. When you resume, the backlog drains automatically.
Kira: Okay, I want to push back on this whole thing for a second. Because I can hear someone listening right now thinking — this is a lot of infrastructure for a solo founder. Circuit breakers, Redis queues, gateway layers, budget webhooks... that's a weekend project for Santi. For someone who's not a former ML engineer, that might be a month.
Santi: ...That's fair.
Kira: And there's a real argument that this adds complexity that creates its own failure modes. Your gateway goes down. Your Redis instance has a hiccup. Your circuit breaker thresholds are miscalibrated and it's flipping to warm when it shouldn't be. You've added three new things that can break.
Santi: Yeah. I've seen that happen. I had a period where my circuit breaker was opening too aggressively — three consecutive failures instead of five — and it was routing to warm during normal API latency spikes. I was paying for OpenAI calls I didn't need.
Kira: So what's the honest answer? Who should build this and who shouldn't?
Santi: You don't build all three tiers on day one. You start with the smallest useful piece. Step one — a circuit breaker and one warm provider. That's it. No queue, no budget webhook, no cold tier. Just — if Anthropic fails five times in a row, switch to OpenAI. That's maybe twenty lines of code and zero infrastructure cost.
Kira: And that alone would have saved you the Thailand incident.
Santi: That alone would have saved me eight thousand dollars. Step two, when you're ready, is the cold queue. Add Redis, add a Slack notification. Now you have graceful degradation. Step three is the budget guardrail. And step three only matters if you're spending enough on API calls that a runaway loop could actually hurt you.
Kira: So the answer to "is this overengineering" is — the full three-tier system might be. But the first tier is table stakes.
Santi: The first tier is table stakes for anyone running client-facing AI. Period.
Kira: And for the people who aren't deeply technical — the prompt engineers, the agency operators who are using Make or n8n more than writing raw code — what's their version of this?
Santi: Same concept at a higher abstraction. Make has error handling paths. n8n has error workflows. You can build a hot-warm pattern inside those tools — primary API call, error path routes to a different provider. It won't have circuit breakers or jitter, but it'll have the basic failover. And for cold, you route to a Google Sheet or Airtable that your VA checks every morning.
Kira: Actually, that's not bad. A Google Sheet as your cold queue. Your VA processes the backlog manually when providers recover.
Santi: It's not elegant, but it works. And it passes the Lisbon Test — you can set it up from a café with sketchy wifi, and your async team can operate it without you being online.
Kira: So let's actually run the Lisbon Test on the full design. Can you deploy this from a café?
Santi: Yes — it's a JSON config, a small wrapper, and a Redis instance. Can your async team operate it without you? Yes — the cold queue notifies them, the client message templates are pre-written, and the budget pause is automatic. Does it survive bad wifi? Yes — the circuit breaker and retry logic handle intermittent connectivity on the infrastructure side, and the queue ensures nothing gets lost.
Kira: And the fifteen-minute validation — block your hot provider's domain locally, confirm warm takes over. Force five hundred errors from your gateway, confirm the circuit opens and cold enqueues. Post a fake budget alert, confirm the pause flag sets and notifications fire.
Santi: If all three of those pass, you're live. And you just made your business meaningfully harder to kill.
Kira: Three outages. Three providers. Thirty-four days. That was the window. And the next window is coming — we just don't know when.
Santi: The thing that changed for me after Thailand wasn't the technology. It was the mindset. I stopped treating outages as unlikely events and started treating them as scheduled maintenance I hadn't been told about yet. Once you make that shift, building the failover isn't optional. It's just... how you build.
Kira: And you don't need the full three-tier system to start. A circuit breaker and one warm provider. That's your first weekend. That's the thing that keeps your client from getting silence when they should be getting a deliverable.
Santi: If you want to ship the whole thing — the JSON router, the Node and Python wrappers, the Redis queue with the pause flag, the budget webhook, the Lisbon Test checklist, the cost spreadsheet — grab the Nomad-Proof Model Failover SOP on the Resources page. It's built to copy and paste.
Kira: One thing to do this week. Pick your primary provider. Pick one warm alternative. Write twenty lines of failover code — or build one error path in Make. Test it. That's it. Everything else can come later, but that first layer can't wait for the next outage to remind you.
Santi: Ship it by Sunday.
Kira: See you next Tuesday.