Santi: Stop telling clients which model you use. They don't care. They have never cared.
Kira: Okay — that's going to upset some people.
Santi: Good. Because I spent two years leading pitches with model names. "We use GPT-4." "We're on Claude." "We just migrated to the latest Anthropic release." And you know what happened every single time a provider went down?
Kira: The client called you.
Santi: The client called me. Not the provider. Me. November twenty-fifth, twenty twenty-four — OpenAI goes down for hours. Widespread timeouts, five-oh-three errors, the whole API is throwing cascading failures. And I'm in a café in Porto watching my Slack light up. Three clients. Same message. "Is our pipeline broken?"
Kira: And your answer was…
Santi: My answer was yes. Because I had one provider. One. No fallback. No failover. No cache. Just a direct dependency on a single API endpoint and a prayer that it would come back before my SLA window closed.
Kira: And it wasn't just OpenAI. Anthropic published a postmortem in September twenty twenty-five — three separate incidents that degraded Claude's API. TechCrunch confirmed outages hitting their console. Cloudflare had a global incident in twenty twenty-four that cascaded into half the AI services on the internet.
Santi: So here's the provocative part. The thing you should be selling is not your model. It's your uptime. It's a number. Ninety-nine point five percent of requests succeed. P95 latency under two and a half seconds. Average cost per request under a penny and a half. That's a promise a client can hold you to — and it's a promise that makes you worth more than the person who just says "we use the best model."
Kira: And you can actually back that up from a laptop in an airport?
Santi: That's what we're building today.
Kira: Here's the problem nobody talks about when they talk about LLM routing. Every provider goes down. Not maybe. Not occasionally. Regularly. And if your revenue depends on AI output — if clients are paying you for pipelines that run on inference — a single-provider architecture is a single point of failure with your name on it.
Santi: So today we're shipping the fix. A two-provider router you can stand up this week with LiteLLM, latency and cost budgets that keep your margins intact, a write-through cache for when you're working off airport wifi, and a thirty-minute failover drill you'll run Friday to prove the whole thing actually works. Plus — the part Kira's been pushing me on — client-facing SLO language you can paste into your next proposal so reliability becomes the reason they pick you.
Kira: So before we get into the build — I want to make sure everyone's on the same page about what an SLO actually is, because people confuse it with an SLA constantly.
Santi: Yeah, and the distinction matters for how you price this.
Kira: Right. So Google's SRE team — the people who literally wrote the book on this — they break it into three layers. You have an SLI, which is the measurement. Like, what's your p95 latency. Then you have the SLO, which is the target. "Ninety-five percent of requests complete in under two and a half seconds." And then the SLA is the contract — the legal commitment with consequences if you miss.
Santi: And most of us should be publishing SLOs, not SLAs. An SLO is a promise you're making to your client about how your system performs. An SLA is a legal obligation with penalties. You want the first one in your proposals. The second one is for when you've got a legal team and insurance.
Kira: And this is the important part — the SLO comes with an error budget. If your target is ninety-nine point five percent success rate over thirty days, that means you're allowed to fail on point five percent of requests. On, say, ten thousand requests a month, that's fifty allowed failures. That's your budget. You spend it on deploys, on experiments, on provider hiccups. And when it's gone, you freeze changes and stabilize.
Santi: Which is a framework most nomad founders have never even considered. We're out here shipping prompt updates on a Thursday afternoon from a hammock and hoping nothing breaks over the weekend.
Kira: Spoken from experience.
Santi: Direct experience. Multiple hammocks.
Kira: So the question becomes — how do you actually hit those numbers when you're one person, maybe two, running everything from a laptop?
Santi: Two providers. That's the starting point. Not five. Not a fancy model cascade. Two. A primary and a secondary, behind a single endpoint your app talks to. Your code never changes when you swap providers. It just calls the same URL.
Kira: Walk me through the actual setup.
Santi: So I use LiteLLM as a proxy. Open source, runs as a lightweight server. You define a config file — YAML — with two model deployments under the same alias. Your primary gets a weight of nine, your secondary gets a weight of one. Under normal conditions, almost everything goes to the primary. But if the primary starts throwing errors or timing out, LiteLLM retries in-group once, then fails over to the secondary automatically.
Kira: And the app doesn't know.
Santi: The app has no idea. It's hitting one endpoint — your proxy — and getting responses. The routing, the retries, the cooldowns, all of that happens behind the proxy. Your client code is just a standard OpenAI SDK call pointed at your own URL.
Kira: Okay but — and I know you're going to push back on this — doesn't adding a proxy add latency? You're putting another hop between your app and the provider.
Santi: It can. And this is where people get burned by gateways like OpenRouter that optimize for price or availability by default, not speed. Operators on Reddit report five to ten X variance in time-to-first-token across providers when they're not pinning. So yes — if you just throw requests at a gateway and let it pick, you might get routed to a cheaper, slower endpoint and your user experience tanks.
Kira: So what do you do?
Santi: You pin your primary. You set an explicit routing order — primary first, secondary only on failure. And you set a stream timeout. I use two seconds for time-to-first-token. If the primary doesn't start streaming in two seconds, it's a timeout, retry, failover. That way your p95 stays tight even when the primary is having a bad day.
Kira: And you keep a bypass switch.
Santi: Always. One environment variable — bypass router equals true — and your app calls the primary directly. If the proxy itself ever misbehaves, you flip that switch and you're back to single-provider in thirty seconds. That's your rollback.
Kira: So the router handles availability. What about cost? Because I've seen people set up multi-provider routing and then get surprised when their secondary provider is three times more expensive per token and their margins evaporate during a failover.
Santi: That's the budget guardrail layer. In LiteLLM you can set per-feature cost caps — maximum cost per request, maximum tokens in, maximum tokens out. If a request would exceed the cap, the router degrades gracefully instead of burning through your budget.
Kira: Degrades how?
Santi: Depends on how you configure it. You can truncate context, switch to a cheaper model alias, or return a cached response. The point is you decide in advance what happens when cost spikes — not in the moment when you're panicking.
Kira: I had a client last year — a B2B SaaS company — and they were running a content generation pipeline through a single provider. No cost caps. A prompt change accidentally doubled their average token count. They didn't notice for eleven days. By the time they caught it, they'd burned through almost two thousand dollars in unexpected API spend.
Santi: Eleven days.
Kira: Eleven days. Because they had no alerts. No per-request cost tracking. Just a monthly invoice that showed up and ruined someone's morning.
Santi: And that's exactly what the budget sheet prevents. You tag every request with tenant ID, feature, provider, input tokens, output tokens, cost. You pipe that into Langfuse or Helicone — both have free tiers — and you set three alerts. P95 latency over your target for fifteen minutes, page. Success rate below target for five minutes, page. Average cost per request over budget for fifteen minutes, page. That's it. Three alerts. You'll catch ninety percent of problems before your client does.
Kira: And the tagging is what makes the SLO enforceable. Without it, you're just guessing.
Santi: Alright — so we've got routing and we've got cost guardrails. The third piece is the cache, and this is the one that matters most when you're traveling.
Kira: Because wifi.
Santi: Because wifi. Because airport throttling. Because that café in Oaxaca where the connection drops every forty-five minutes and you're trying to demo a pipeline to a prospect.
Kira: I lost internet for three days in Guatemala once. Right before a major client deadline. Three days. And the only reason we survived is because my project management system had enough cached state that my contractors could keep working without me.
Santi: That's the principle. Ink and Switch published a research essay called "Local-First Software" — the core idea is you design for full offline read and write, then sync when connectivity returns. For AI apps, the practical version is a write-through cache. Every response from the router gets written to a local cache — Redis, SQLite, whatever — keyed on a normalized version of the prompt. Next time the same request comes in, you check the cache first. If it's fresh, you serve it instantly. No API call. No latency. No cost.
Kira: And on the client side?
Santi: Service worker. Intercepts fetch requests to your AI endpoint. If the network call fails — timeout, offline, whatever — it falls back to the cached response. Your user sees a result instead of a spinner. It's not perfect. The cache might be stale. But a slightly stale answer is infinitely better than a loading screen that never resolves.
Kira: And this is the important part — you're also saving money. Every cache hit is a request you didn't send to a provider. On high-volume features with repetitive prompts, I've seen cache hit rates above sixty percent.
Santi: Sixty?
Kira: On a content classification pipeline where the categories don't change often. The prompts are nearly identical. Normalize them, hash them, and most of them hit cache. My client's API bill dropped by more than half.
Santi: And then there's provider-side prompt caching on top of that. Anthropic lets you mark stable blocks in your prompt — system instructions, tool definitions, anything that doesn't change between requests — and they cache the encoding. Default TTL is five minutes. You can pay for a one-hour window. Cache hits reduce both latency and input token cost because the provider skips re-encoding those blocks.
Kira: So you've got two layers of caching. Your own write-through cache for full responses, and the provider's prompt cache for the expensive parts of the input.
Santi: Exactly. And both of them make your SLO easier to hit because you're reducing the number of full round-trips to the provider.
Kira: Okay — so now you've got the infrastructure. Router, budgets, cache. The piece that actually makes this a business advantage is putting it in writing for the client.
Santi: This is your territory.
Kira: It is. And I'll be direct — most founders I talk to are terrified of publishing reliability numbers because they think it creates liability. But an SLO is not a legal guarantee. It's a transparency commitment. You're saying "here's how we measure ourselves, here's what we target, and here's what we do when we miss."
Santi: And that's more than ninety-nine percent of AI agencies are offering right now.
Kira: Way more. Most proposals I review say something like "we use state-of-the-art AI models" and leave it at that. No numbers. No targets. No accountability. So when you show up with a one-pager that says "ninety-nine point five percent success rate, p95 latency under two and a half seconds, average cost per request under a penny and a half, measured over a rolling thirty-day window" — you're speaking a language that procurement teams and technical buyers actually trust.
Santi: I'll admit — I resisted this for a long time. I thought the technical build was enough. If the system works, the client's happy. Why put a number on it?
Kira: Because the number is what gets you past the technical buyer to the budget holder. The CTO understands your architecture. The VP of Operations understands "ninety-nine point five percent uptime." Different audiences, different languages.
Santi: Fair. And you've got SOW language for this?
Kira: Copy-paste ready. The kit has a clause you can drop into any proposal. It covers the SLO targets, the measurement method, the error budget policy, exclusions for scheduled maintenance, and a credit structure if you miss. Quick disclaimer — this is template language, not legal advice. Have your counsel review it before you ship it to a client. But the structure is sound and it's based on how Google's SRE team frames these commitments.
Santi: Last piece. The drill. Because none of this matters if you've never tested it under pressure.
Kira: And this is where I actually want to push you on something. Because I know your instinct is to automate the drill. Make it a cron job. Run it silently.
Santi: I mean — yeah. Why not?
Kira: Because the point of the drill isn't to test the system. It's to test you. AWS calls these chaos game days. Google calls them Wheel of Misfortune exercises. The whole idea is that a human sits in front of the dashboard, induces a failure, watches what happens, and practices the response. If you automate it, you're testing the automation. You're not testing whether you can actually respond to an incident at two AM from a hostel in Lisbon.
Santi: I've responded to incidents from worse places than a hostel in Lisbon.
Kira: I know you have. But that's exactly why you need the drill. Thirty minutes. Three roles — even if you're playing all three yourself. Drill lead runs the clock. Operator flips the switch. Scribe captures what happened. You revoke your primary provider's API key, watch the router fail over, confirm your p95 stays within target, then restore the key and verify everything's green.
Santi: And you tie the results back to your error budget. If the failover took longer than expected, or if your success rate dipped below the SLO during the drill, that's a finding. You log it, you fix it, and you run the drill again next quarter.
Kira: The kit has the full SOP. Pre-checks, three different ways to induce the failure, expected outcomes, rollback steps, and a closeout template. Thirty minutes. Every Friday until it's boring. And when it's boring, that's when you know it works.
Santi: When it's boring. That's the goal.
Kira: That's always the goal with infrastructure. If it's exciting, something's wrong.
Santi: So — back to where we started. I said stop telling clients which model you use. And I realize that sounds extreme. But the point isn't that models don't matter. The point is that your client doesn't care about your model. They care about whether the thing works when they need it to work. And if you can hand them a one-pager that says "here's my uptime target, here's my latency target, here's my cost target, and here's what I do when I miss" — you've just differentiated yourself from every other AI agency that leads with a model name and a vibe.
Kira: And the infrastructure to back it up is not a six-month project. It's a router config, three alerts, a cache layer, and a Friday drill. You can ship the first version this week. You can run the first drill this Friday. And by next Monday you'll have more operational visibility into your AI stack than most funded startups.
Santi: We put the whole thing in a kit — the SLO one-pager template, the budget guardrail sheet with alert thresholds, the router config, the cache recipe, the drill SOP with rollback steps, and the client-safe proposal language. It's on the Resources page. Grab it and ship this week.
Kira: One action. This week. Stand up the router with two providers. Set the three alerts. Run the drill Friday. That's it. Everything else — the cache, the SLO language, the observability — you layer in over the next two weeks. But the router and the drill come first.
Santi: Because the next time a provider goes down — and it will go down — you want to be the person whose system fails over silently, not the person whose Slack is on fire.
Kira: Ship the boring infrastructure. Sell the boring promise. Win the clients who care about reliability more than hype.
Santi: See you Wednesday.
Kira: See you Wednesday.