Episode 11·June 3, 2026

Stop Selling Model Names. Sell Uptime: Multi-Provider Routing with Client-Facing SLOs

Spotify Apple Podcasts RSS Feed Open Companion Resource

Intro

This episode is for technical founders and ops-minded agency owners whose revenue depends on consistent AI output while traveling. You'll get a complete reliability stack you can ship this week: router, budgets, cache, and client-facing SLOs that turn uptime into your competitive moat.

In This Episode

Santi and Kira build a production-ready reliability stack for AI businesses. They start with the provocative claim that you should stop selling model names and start selling uptime—backed by specific SLO numbers clients can hold you to. Santi walks through a minimal two-provider router using LiteLLM, with weighted failover, latency budgets, and cost guardrails that prevent margin evaporation during outages. Kira translates the technical infrastructure into client-facing proposal language, showing how "99.5% success rate, p95 latency under 2.5 seconds" speaks to procurement teams better than "we use the best model." They cover write-through caching for travel-mode reliability, provider-side prompt caching for cost optimization, and a 30-minute Friday drill that keeps your failover paths working. The episode concludes with a complete reliability kit: SLO templates, budget sheets, router configs, and drill SOPs you can implement immediately.

Key Takeaways

Replace model-name pitches with concrete SLO promises: 99.5% success rate, p95 latency under 2.5 seconds, and cost per request under $0.015 create accountability that procurement teams trust
A minimal two-provider router with LiteLLM, weighted failover, and budget guardrails prevents both outages and margin evaporation while maintaining OpenAI-compatible endpoints
Run a 30-minute Friday drill that revokes your primary provider's API key, confirms failover performance, and ties results to your error budget—boring infrastructure that works under pressure

Timestamps

Companion Resource

guide

The Reliability SLO Kit for AI Apps: Two‑Provider Router + Budgets + 30‑Minute Failover Drill

A practical, copy‑paste kit for nomad‑run AI apps: fill a client‑safe SLO, set budget guardrails, drop in a two‑provider LiteLLM router and a travel‑mode cache, and run a 30‑minute failover drill with a clear rollback.

Google SRE Book – Service Level Objectives
sre.google
- - Google’s SRE model defines SLIs, SLOs, and error budgets; an error budget equals (1 − SLO) over a defined period.
Google SRE – Monitoring Distributed Systems
sre.google
- - The Four Golden Signals for reliability monitoring are latency, traffic, errors, and saturation; they underpin alerting and SLOs.
AWS Prescriptive Guidance – Implementing chaos engineering on AWS
docs.aws.amazon.com
- - AWS recommends structured, recurring chaos game days to verify reliability and resilience in production systems.
LiteLLM – Router / Routing & Load Balancing docs
docs.litellm.ai
- - LiteLLM Router supports multi‑provider load balancing, retries, timeouts, cooldowns, weighted failover, and budget‑aware routing.
OpenRouter – Provider Routing docs
openrouter.ai
- - OpenRouter can route to multiple underlying providers for the same model and apply routing strategies when parameters differ across providers.
Redis – Caching patterns
redis.io
- - Redis write‑through caching writes to cache and datastore synchronously to keep reads hot and data consistent.
Wikipedia – Progressive Web App / Service Worker context
en.wikipedia.org
- - Service workers enable offline‑first web apps by intercepting network requests and serving cached resources; they supersede deprecated app‑cache.
Ink & Switch – Local‑First Software essay
inkandswitch.com
- - Ink & Switch’s ‘Local‑First Software’ research recommends designing for full offline read/write with later sync as a durable UX pattern.
Anthropic Claude API docs – Prompt caching
platform.claude.com
- - Anthropic’s prompt caching allows marking stable prompt blocks; default TTL is 5 minutes with an optional 1‑hour cache at additional cost; cache hits reduce re‑encoding latency and input‑token cost.
OpenAI Status; CNBC coverage
status.openai.com
- - OpenAI suffered multiple notable outages in late 2024, including a multi‑hour API/ChatGPT incident on Dec 11, 2024 and widespread timeouts on Nov 25, 2024.
TechCrunch; TechRadar/Tom’s Guide live coverage
techcrunch.com
- - Anthropic experienced API/Console outages in 2025 and multiple Claude outages in April 2026 reported by tech media and status pages.
OpenTelemetry – GenAI semantic conventions; Langfuse docs
opentelemetry.io
- - OpenTelemetry’s Generative AI semantic conventions define standard attributes and spans for LLM calls; tools like Langfuse can ingest these for token/cost/latency tracking and alerts.
OpenAI Status page – Elevated Error Rate for ChatGPT and API (post-incident write‑up)
status.openai.com
- - OpenAI API outage (Nov 25, 2024)
- - Demonstrates that even top providers experience multi‑hour API failures; justifies cross‑provider failover.
OpenAI Status page – Increased errors for ChatGPT (Nov 7, 2024)
status.openai.com
- - OpenAI partial outage (Nov 7, 2024)
- - Shows recurring, time‑bounded incidents that can disrupt production workloads; supports the need for an explicit SLO and failover plan.
Anthropic Engineering blog – Postmortem of three recent issues (Sep 17, 2025)
anthropic.com
- - Anthropic Claude reliability incidents (Aug–Sep 2025)
- - Validates that Claude/Anthropic also experience service degradations; argues for multi‑provider routing instead of single‑vendor dependency.
TechCrunch – Anthropic reports outages (Sep 10, 2025)
techcrunch.com
- - Anthropic outage news coverage
- - Third‑party confirmation of Claude outages impacting APIs and Console; supports business case for failover.
Cloudflare engineering postmortems (2023–2025) incl. Jun 20, 2024 incident and Nov 2025 global outage coverage
blog.cloudflare.com
- - Cloudflare platform incidents impacting many downstream services
- - Illustrates that upstream internet infrastructure issues can cascade to AI APIs; motivates offline‑first caches and drill‑backed runbooks.
LiteLLM Router docs – Routing, fallbacks, cooldowns, budget routing
docs.litellm.ai
- - LiteLLM Router in production
- - A concrete, open‑source way to stand up a two‑provider router with budget/latency guardrails this week.
OpenRouter docs – Provider routing
openrouter.ai
- - OpenRouter provider‑level routing
- - Confirms that gateways can route across providers for the same model; highlights parameter handling trade‑offs.
AWS Prescriptive Guidance – Chaos engineering and game days
docs.aws.amazon.com
- - Chaos game days as resilience drills
- - Authoritative guidance for running short, recurring failover drills; maps neatly to a 30‑minute Friday drill for AI apps.
Google SRE site – Service Level Objectives; Incident Response; Wheel of Misfortune exercises
sre.google
- - SRE SLOs/error budgets and incident‑response practice
- - Industry‑standard framing for client‑facing SLOs, error budgets, and drills that keep failover paths working.
Redis docs – Write‑through and cache‑aside patterns
redis.io
- - Write‑through caching for offline‑first and rate‑limit resilience
- - Canonical caching patterns to back a travel‑mode local cache and API‑cost guardrails.
Anthropic Claude docs – Prompt caching
platform.claude.com
- - Provider‑side prompt caching to reduce latency/cost
- - Concrete API‑level optimization for large/system prompts; belongs in budget guardrails and latency SLOs.
OpenTelemetry – Generative AI semantic conventions; Langfuse docs
opentelemetry.io
- - Observability standards and tools for latency/error/cost SLOs and alerts
- - Lets teams emit consistent metrics/traces and alert on p95 latency, error rate, and spend per user/feature.

Santi: Stop telling clients which model you use. They don't care. They have never cared.

Kira: Okay — that's going to upset some people.

Santi: Good. Because I spent two years leading pitches with model names. "We use GPT-4." "We're on Claude." "We just migrated to the latest Anthropic release." And you know what happened every single time a provider went down?

Kira: The client called you.

Santi: The client called me. Not the provider. Me. November twenty-fifth, twenty twenty-four — OpenAI goes down for hours. Widespread timeouts, five-oh-three errors, the whole API is throwing cascading failures. And I'm in a café in Porto watching my Slack light up. Three clients. Same message. "Is our pipeline broken?"

Kira: And your answer was…

Santi: My answer was yes. Because I had one provider. One. No fallback. No failover. No cache. Just a direct dependency on a single API endpoint and a prayer that it would come back before my SLA window closed.

Kira: And it wasn't just OpenAI. Anthropic published a postmortem in September twenty twenty-five — three separate incidents that degraded Claude's API. TechCrunch confirmed outages hitting their console. Cloudflare had a global incident in twenty twenty-four that cascaded into half the AI services on the internet.

Santi: So here's the provocative part. The thing you should be selling is not your model. It's your uptime. It's a number. Ninety-nine point five percent of requests succeed. P95 latency under two and a half seconds. Average cost per request under a penny and a half. That's a promise a client can hold you to — and it's a promise that makes you worth more than the person who just says "we use the best model."

Kira: And you can actually back that up from a laptop in an airport?

Santi: That's what we're building today.

Kira: Here's the problem nobody talks about when they talk about LLM routing. Every provider goes down. Not maybe. Not occasionally. Regularly. And if your revenue depends on AI output — if clients are paying you for pipelines that run on inference — a single-provider architecture is a single point of failure with your name on it.

Santi: So today we're shipping the fix. A two-provider router you can stand up this week with LiteLLM, latency and cost budgets that keep your margins intact, a write-through cache for when you're working off airport wifi, and a thirty-minute failover drill you'll run Friday to prove the whole thing actually works. Plus — the part Kira's been pushing me on — client-facing SLO language you can paste into your next proposal so reliability becomes the reason they pick you.

Kira: So before we get into the build — I want to make sure everyone's on the same page about what an SLO actually is, because people confuse it with an SLA constantly.

Santi: Yeah, and the distinction matters for how you price this.

Kira: Right. So Google's SRE team — the people who literally wrote the book on this — they break it into three layers. You have an SLI, which is the measurement. Like, what's your p95 latency. Then you have the SLO, which is the target. "Ninety-five percent of requests complete in under two and a half seconds." And then the SLA is the contract — the legal commitment with consequences if you miss.

Santi: And most of us should be publishing SLOs, not SLAs. An SLO is a promise you're making to your client about how your system performs. An SLA is a legal obligation with penalties. You want the first one in your proposals. The second one is for when you've got a legal team and insurance.

Kira: And this is the important part — the SLO comes with an error budget. If your target is ninety-nine point five percent success rate over thirty days, that means you're allowed to fail on point five percent of requests. On, say, ten thousand requests a month, that's fifty allowed failures. That's your budget. You spend it on deploys, on experiments, on provider hiccups. And when it's gone, you freeze changes and stabilize.

Santi: Which is a framework most nomad founders have never even considered. We're out here shipping prompt updates on a Thursday afternoon from a hammock and hoping nothing breaks over the weekend.

Kira: Spoken from experience.

Santi: Direct experience. Multiple hammocks.

Kira: So the question becomes — how do you actually hit those numbers when you're one person, maybe two, running everything from a laptop?

Santi: Two providers. That's the starting point. Not five. Not a fancy model cascade. Two. A primary and a secondary, behind a single endpoint your app talks to. Your code never changes when you swap providers. It just calls the same URL.

Kira: Walk me through the actual setup.

Santi: So I use LiteLLM as a proxy. Open source, runs as a lightweight server. You define a config file — YAML — with two model deployments under the same alias. Your primary gets a weight of nine, your secondary gets a weight of one. Under normal conditions, almost everything goes to the primary. But if the primary starts throwing errors or timing out, LiteLLM retries in-group once, then fails over to the secondary automatically.

Kira: And the app doesn't know.

Santi: The app has no idea. It's hitting one endpoint — your proxy — and getting responses. The routing, the retries, the cooldowns, all of that happens behind the proxy. Your client code is just a standard OpenAI SDK call pointed at your own URL.

Kira: Okay but — and I know you're going to push back on this — doesn't adding a proxy add latency? You're putting another hop between your app and the provider.

Santi: It can. And this is where people get burned by gateways like OpenRouter that optimize for price or availability by default, not speed. Operators on Reddit report five to ten X variance in time-to-first-token across providers when they're not pinning. So yes — if you just throw requests at a gateway and let it pick, you might get routed to a cheaper, slower endpoint and your user experience tanks.

Kira: So what do you do?

Santi: You pin your primary. You set an explicit routing order — primary first, secondary only on failure. And you set a stream timeout. I use two seconds for time-to-first-token. If the primary doesn't start streaming in two seconds, it's a timeout, retry, failover. That way your p95 stays tight even when the primary is having a bad day.

Kira: And you keep a bypass switch.

Santi: Always. One environment variable — bypass router equals true — and your app calls the primary directly. If the proxy itself ever misbehaves, you flip that switch and you're back to single-provider in thirty seconds. That's your rollback.

Kira: So the router handles availability. What about cost? Because I've seen people set up multi-provider routing and then get surprised when their secondary provider is three times more expensive per token and their margins evaporate during a failover.

Santi: That's the budget guardrail layer. In LiteLLM you can set per-feature cost caps — maximum cost per request, maximum tokens in, maximum tokens out. If a request would exceed the cap, the router degrades gracefully instead of burning through your budget.

Kira: Degrades how?

Santi: Depends on how you configure it. You can truncate context, switch to a cheaper model alias, or return a cached response. The point is you decide in advance what happens when cost spikes — not in the moment when you're panicking.

Kira: I had a client last year — a B2B SaaS company — and they were running a content generation pipeline through a single provider. No cost caps. A prompt change accidentally doubled their average token count. They didn't notice for eleven days. By the time they caught it, they'd burned through almost two thousand dollars in unexpected API spend.

Santi: Eleven days.

Kira: Eleven days. Because they had no alerts. No per-request cost tracking. Just a monthly invoice that showed up and ruined someone's morning.

Santi: And that's exactly what the budget sheet prevents. You tag every request with tenant ID, feature, provider, input tokens, output tokens, cost. You pipe that into Langfuse or Helicone — both have free tiers — and you set three alerts. P95 latency over your target for fifteen minutes, page. Success rate below target for five minutes, page. Average cost per request over budget for fifteen minutes, page. That's it. Three alerts. You'll catch ninety percent of problems before your client does.

Kira: And the tagging is what makes the SLO enforceable. Without it, you're just guessing.

Santi: Alright — so we've got routing and we've got cost guardrails. The third piece is the cache, and this is the one that matters most when you're traveling.

Kira: Because wifi.

Santi: Because wifi. Because airport throttling. Because that café in Oaxaca where the connection drops every forty-five minutes and you're trying to demo a pipeline to a prospect.

Kira: I lost internet for three days in Guatemala once. Right before a major client deadline. Three days. And the only reason we survived is because my project management system had enough cached state that my contractors could keep working without me.

Santi: That's the principle. Ink and Switch published a research essay called "Local-First Software" — the core idea is you design for full offline read and write, then sync when connectivity returns. For AI apps, the practical version is a write-through cache. Every response from the router gets written to a local cache — Redis, SQLite, whatever — keyed on a normalized version of the prompt. Next time the same request comes in, you check the cache first. If it's fresh, you serve it instantly. No API call. No latency. No cost.

Kira: And on the client side?

Santi: Service worker. Intercepts fetch requests to your AI endpoint. If the network call fails — timeout, offline, whatever — it falls back to the cached response. Your user sees a result instead of a spinner. It's not perfect. The cache might be stale. But a slightly stale answer is infinitely better than a loading screen that never resolves.

Kira: And this is the important part — you're also saving money. Every cache hit is a request you didn't send to a provider. On high-volume features with repetitive prompts, I've seen cache hit rates above sixty percent.

Santi: Sixty?

Kira: On a content classification pipeline where the categories don't change often. The prompts are nearly identical. Normalize them, hash them, and most of them hit cache. My client's API bill dropped by more than half.

Santi: And then there's provider-side prompt caching on top of that. Anthropic lets you mark stable blocks in your prompt — system instructions, tool definitions, anything that doesn't change between requests — and they cache the encoding. Default TTL is five minutes. You can pay for a one-hour window. Cache hits reduce both latency and input token cost because the provider skips re-encoding those blocks.

Kira: So you've got two layers of caching. Your own write-through cache for full responses, and the provider's prompt cache for the expensive parts of the input.

Santi: Exactly. And both of them make your SLO easier to hit because you're reducing the number of full round-trips to the provider.

Kira: Okay — so now you've got the infrastructure. Router, budgets, cache. The piece that actually makes this a business advantage is putting it in writing for the client.

Santi: This is your territory.

Kira: It is. And I'll be direct — most founders I talk to are terrified of publishing reliability numbers because they think it creates liability. But an SLO is not a legal guarantee. It's a transparency commitment. You're saying "here's how we measure ourselves, here's what we target, and here's what we do when we miss."

Santi: And that's more than ninety-nine percent of AI agencies are offering right now.

Kira: Way more. Most proposals I review say something like "we use state-of-the-art AI models" and leave it at that. No numbers. No targets. No accountability. So when you show up with a one-pager that says "ninety-nine point five percent success rate, p95 latency under two and a half seconds, average cost per request under a penny and a half, measured over a rolling thirty-day window" — you're speaking a language that procurement teams and technical buyers actually trust.

Santi: I'll admit — I resisted this for a long time. I thought the technical build was enough. If the system works, the client's happy. Why put a number on it?

Kira: Because the number is what gets you past the technical buyer to the budget holder. The CTO understands your architecture. The VP of Operations understands "ninety-nine point five percent uptime." Different audiences, different languages.

Santi: Fair. And you've got SOW language for this?

Kira: Copy-paste ready. The kit has a clause you can drop into any proposal. It covers the SLO targets, the measurement method, the error budget policy, exclusions for scheduled maintenance, and a credit structure if you miss. Quick disclaimer — this is template language, not legal advice. Have your counsel review it before you ship it to a client. But the structure is sound and it's based on how Google's SRE team frames these commitments.

Santi: Last piece. The drill. Because none of this matters if you've never tested it under pressure.

Kira: And this is where I actually want to push you on something. Because I know your instinct is to automate the drill. Make it a cron job. Run it silently.

Santi: I mean — yeah. Why not?

Kira: Because the point of the drill isn't to test the system. It's to test you. AWS calls these chaos game days. Google calls them Wheel of Misfortune exercises. The whole idea is that a human sits in front of the dashboard, induces a failure, watches what happens, and practices the response. If you automate it, you're testing the automation. You're not testing whether you can actually respond to an incident at two AM from a hostel in Lisbon.

Santi: I've responded to incidents from worse places than a hostel in Lisbon.

Kira: I know you have. But that's exactly why you need the drill. Thirty minutes. Three roles — even if you're playing all three yourself. Drill lead runs the clock. Operator flips the switch. Scribe captures what happened. You revoke your primary provider's API key, watch the router fail over, confirm your p95 stays within target, then restore the key and verify everything's green.

Santi: And you tie the results back to your error budget. If the failover took longer than expected, or if your success rate dipped below the SLO during the drill, that's a finding. You log it, you fix it, and you run the drill again next quarter.

Kira: The kit has the full SOP. Pre-checks, three different ways to induce the failure, expected outcomes, rollback steps, and a closeout template. Thirty minutes. Every Friday until it's boring. And when it's boring, that's when you know it works.

Santi: When it's boring. That's the goal.

Kira: That's always the goal with infrastructure. If it's exciting, something's wrong.

Santi: So — back to where we started. I said stop telling clients which model you use. And I realize that sounds extreme. But the point isn't that models don't matter. The point is that your client doesn't care about your model. They care about whether the thing works when they need it to work. And if you can hand them a one-pager that says "here's my uptime target, here's my latency target, here's my cost target, and here's what I do when I miss" — you've just differentiated yourself from every other AI agency that leads with a model name and a vibe.

Kira: And the infrastructure to back it up is not a six-month project. It's a router config, three alerts, a cache layer, and a Friday drill. You can ship the first version this week. You can run the first drill this Friday. And by next Monday you'll have more operational visibility into your AI stack than most funded startups.

Santi: We put the whole thing in a kit — the SLO one-pager template, the budget guardrail sheet with alert thresholds, the router config, the cache recipe, the drill SOP with rollback steps, and the client-safe proposal language. It's on the Resources page. Grab it and ship this week.

Kira: One action. This week. Stand up the router with two providers. Set the three alerts. Run the drill Friday. That's it. Everything else — the cache, the SLO language, the observability — you layer in over the next two weeks. But the router and the drill come first.

Santi: Because the next time a provider goes down — and it will go down — you want to be the person whose system fails over silently, not the person whose Slack is on fire.

Kira: Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype.

Santi: See you Wednesday.

Kira: See you Wednesday.

LLM routingmulti-provider routingSLOserror budgetsAI reliabilityfailover drillsLiteLLMcost guardrailstravel-mode cacheclient proposalsnomad infrastructureAI business operations