Guide

The Reliability SLO Kit for AI Apps: Two‑Provider Router + Budgets + 30‑Minute Failover Drill

A practical, copy‑paste kit for nomad‑run AI apps: fill a client‑safe SLO, set budget guardrails, drop in a two‑provider LiteLLM router and a travel‑mode cache, and run a 30‑minute failover drill with a clear rollback.

Ship a client‑facing reliability promise this week, then back it with a tiny, robust stack you can run from a café: a two‑provider router, concrete latency/cost budgets, a write‑through cache for travel‑mode, and a 30‑minute failover drill you’ll actually run. Copy what you need, fill in the brackets, and go.

How to use this kit this week (quick path)

Use this kit as a one‑page operating contract you can share with clients and your team. Fill the SLO template, set your budgets and alerts, stand up the router and cache, then run the drill on Friday. Keep it minimal: two providers only, one OpenAI‑compatible surface, and clear rollback switches.

SLO one‑pager (fill‑in template)

Copy this into your repo/wiki as slo.md. Replace [PLACEHOLDERS] with your details and publish it to your client (proposal/SOW link or changelog).

[SERVICE_NAME] — Reliability SLO (Client‑Facing)

Scope

  • Covered features: [FEATURE_A], [FEATURE_B].
  • Traffic profile: [AVG_REQUESTS_PER_DAY] requests/day; peak [PEAK_RPS] rps.
  • Measurement window: rolling [WINDOW_DAYS] days (default: 30 days).

SLIs (how we measure)

  • Success rate: 1 − (retryable_error_count + timeout_count) / total_requests at the router edge.
  • p95 end‑to‑end latency: time from client request arriving at our API edge to first token/byte returned.
  • Cost per request: provider charges plus gateway fees divided by total_requests, reported as average and p95.

Targets (SLOs)

  • Success rate: ≥ [SLO_SUCCESS_RATE]% over [WINDOW_DAYS] days. Example default: 99.5%.
  • p95 latency: ≤ [SLO_P95_LATENCY_SECONDS]s. Example default: 2.5s.
  • Avg cost/request: ≤ $[SLO_COST_AVG_USD]. Example default: 0.015.

Error budget

  • Budget = 1 − (SLO_SUCCESS_RATE/100). Over [WINDOW_DAYS] days and [EST_MONTHLY_REQUESTS] requests, allowed errors = [ERROR_BUDGET_REQS].

Measurement + tooling

  • Metrics source of truth: [OBS_TOOL] (e.g., Langfuse/Helicone + OpenTelemetry GenAI traces).
  • Time sync: UTC. Retention: [RETENTION_DAYS] days.

Burn‑rate policy (what we do when it slips)

  • Fast burn: If success‑rate burn‑rate > 2× over 1 hour, trigger incident, fail over to secondary, and freeze launches.
  • Slow burn: If monthly error budget > 50% consumed by day 15 or > 80% by day 25, reduce model cost, enable prompt caching, and review prompts/context.

Change management

  • Any model/provider change must pass the p95 latency target on 200 sample requests and keep avg cost within SLO for [TRIAL_DAYS] days before promotion.

Exclusions (transparent to client)

  • Scheduled maintenance windows (≤ [MAINT_MINUTES] minutes, [MAINT_COUNT] per quarter) announced 48h in advance.
  • Abuse or usage outside stated limits.

Contact + status

  • Incident comms: [ONCALL_CHANNEL] (e.g., status page + email list). Review cadence: monthly. —

Budget guardrails (per‑feature, per‑user, per‑tenant)

Put this next to your SLO, or keep as budgets.md. These are concrete caps and alerts that hold the line on spend without guesswork.

Per‑feature caps (defaults — tune per app)

  • [FEATURE_A]
    • Max input tokens: [MAX_INPUT_TOKENS] (cap context length).
    • Max output tokens: [MAX_OUTPUT_TOKENS].
    • Cost cap per request: $[MAX_COST_PER_REQ] (example: 0.020). Abort/degrade if exceeded.
    • Fallback rule: if stream_timeout > [TTFT_S] or retryable_error, retry in‑group once, then fail over.
  • [FEATURE_B]
    • Cache policy: write‑through; TTL [CACHE_TTL_MIN] min; cache key = normalized prompt + [TENANT_ID].
    • Prompt caching: mark stable system/tooling blocks cacheable where supported.

Per‑user/tenant caps

  • Per‑user monthly cost cap: $[USER_COST_CAP] (alert at 60/85/95%).
  • Per‑tenant monthly cost cap: $[TENANT_COST_CAP] with the same alert ladder.
  • Request rate caps: [RPS] sustained, [BURST_RPS] burst; reject with retry‑after when exceeded.

Alerts (wire to Slack/Webhook)

  • p95 latency > [P95_TARGET]s for 15 min → page.
  • Success rate < [SUCCESS_TARGET]% for 5 min → page; for 60 min → incident.
  • Cost/request avg > $[COST_TARGET] for 15 min → page; monthly spend > [BUDGET_80]% by day 20 → review; > [BUDGET_95]% any day → safeguard mode (degrade model/context).

Implementation notes

  • Tag every request with: tenant_id, user_id, feature, provider, model, input_tokens, output_tokens, latency_ms, cost_usd, cache_hit.
  • Use your router’s spend‑tracking or callbacks to stream metrics into [LANGFUSE_OR_HELICONE]; set the three alerts above on feature=* and tenant_id=* views.
  • Keep a single OpenAI‑compatible surface so the app never changes SDKs when you swap providers.

Minimal two‑provider router (LiteLLM) — copy/paste config

Drop this into litellm_config.yaml. It exposes one OpenAI‑compatible model alias that prefers your primary, then fails over cleanly. Values shown are safe defaults; adjust to your targets. See your router docs for allowed keys.

# litellm_config.yaml — two‑provider router with latency + cost guardrails
model_list:
  # Primary deployment (order=1)
  - model_name: app/chat                # Your stable alias used by clients
    litellm_params:
      model: [PROVIDER_A_MODEL_ID]      # e.g., openai/gpt‑X or anthropic/claude‑Y
      api_key: os.environ/PROVIDER_A_KEY
      api_base: os.environ/PROVIDER_A_API_BASE   # if required by provider
      order: 1
      timeout: 30                       # hard cap per call (s)
      stream_timeout: 2                 # time‑to‑first‑token cap (s)
      weight: 9                         # prefer primary under normal ops

  # Secondary deployment (order=2)
  - model_name: app/chat
    litellm_params:
      model: [PROVIDER_B_MODEL_ID]
      api_key: os.environ/PROVIDER_B_KEY
      api_base: os.environ/PROVIDER_B_API_BASE
      order: 2
      timeout: 30
      stream_timeout: 2
      weight: 1

router_settings:
  routing_strategy: simple-shuffle      # recommended default
  enable_weighted_failover: true        # retry peers before cross‑provider failover
  allowed_fails: 3                      # cooldown if &gt;3 failures/min
  cooldown_time: 30                     # seconds to cool a failing deployment
  timeout: 30                           # global guard (s)
  # optional: latency window for latency‑based groups
  # routing_groups:
  #   - group_name: latency-sensitive
  #     models: [app/chat]
  #     routing_strategy: latency-based-routing
  #     routing_strategy_args: { ttl: 60 }

litellm_settings:
  master_key: sk‑local‑dev‑key          # virtual key clients use against the proxy
  database_url: [POSTGRES_URL]          # if you use the built‑in spend tracking

Client code always points to your proxy base URL and alias:

from openai import OpenAI
client = OpenAI(api_key=&quot;sk-local-dev-key&quot;, base_url=&quot;https://router.yourdomain.com&quot;)
resp = client.chat.completions.create(model=&quot;app/chat&quot;, messages=[{&quot;role&quot;:&quot;user&quot;,&quot;content&quot;:&quot;Ping&quot;}])

Operator notes

  • Pin providers on latency‑critical paths (set order and keep routing_strategy=simple-shuffle).
  • Keep max_retries modest (2–3) and prefer fast failover to meet p95.
  • Add rpm/tpm per deployment to respect provider rate limits and avoid hot spots.

Travel‑mode write‑through cache (server + client)

Goal: keep answers hot and predictable when Wi‑Fi is shaky, and shave cost/latency.

Server‑side write‑through recipe (Node/Redis pseudo‑code)

// key = stable hash of (tenant_id + feature + normalized_prompt)
const key = `genai:${tenantId}:${feature}:${hash(normalize(prompt))}`
const cached = await redis.get(key)
if (cached &amp;&amp; isFresh(cached.ttl)) return JSON.parse(cached.value)

const result = await callRouter(prompt, opts)           // calls model alias (e.g., app/chat)
await redis.set(key, JSON.stringify(result), { EX: 3600 }) // write‑through; keep hot for 60m
return result

Browser travel‑mode (Service Worker)

self.addEventListener(&#39;fetch&#39;, (event) =&gt; {
  const url = new URL(event.request.url)
  if (url.pathname === &#39;/ai&#39;) {
    event.respondWith((async () =&gt; {
      try { return await fetch(event.request) } catch (e) {
        const cache = await caches.open(&#39;ai-cache-v1&#39;)
        const cached = await cache.match(event.request)
        if (cached) return cached
        return new Response(JSON.stringify({ error: &#39;offline&#39; }), { status: 503 })
      }
    })())
  }
})

Implementation tips

  • Normalize prompts (trim, lowercase, remove volatile IDs) so cache keys hit.
  • Cache only safe, non‑PII outputs; encrypt at rest if needed. Add cache_hit to telemetry.
  • Use provider prompt‑caching where supported for heavy, stable system/tool blocks.
  • Cap cached TTLs per feature; purge eagerly on prompt/schema changes.

30‑minute failover drill SOP + rollback plan

Run this in a staging environment wired to real providers, or in prod on a low‑risk feature flag. Keep it under 30 minutes.

Roles

  • Drill lead (DL): runs the clock and calls rollback.
  • Operator: flips flags/keys and watches dashboards.
  • Scribe: captures timings, screenshots, and diffs.

Pre‑checks (5 min)

  • Confirm you can bypass the router in one switch: BYPASS_ROUTER=true path or direct‑to‑primary client.
  • Pull current p95 latency, success rate, and cost/request baselines for [FEATURE].
  • Open on‑call channel + status note: “Failover drill (non‑customer impacting).”

Induce failure (5–10 min)

  • Option A (preferred): temporarily revoke/rotate the primary provider key in the router.
  • Option B: add an outbound firewall rule blocking egress to [PRIMARY_API_BASE] from the app for 10 minutes.
  • Option C: raise stream_timeout to an unrealistically low value (e.g., 10–50ms) on primary to force timeouts.

Expected outcome (during failure)

  • Router retries in‑group, then fails over cross‑provider within [FAILOVER_TARGET_SECONDS]s.
  • p95 latency stays ≤ [SLO_P95_LATENCY_SECONDS]s; success rate ≥ [MIN_SUCCESS_DURING_FAILOVER]%.
  • Cost/request may change; stays within budget guardrail for [FEATURE].

Rollback (2 min)

  • Restore the key/remove firewall rule; confirm health checks green.
  • Clear any manual cooldowns; redeploy router config if edited.

Verification (8–12 min)

  • Pull 15‑minute drill window metrics; annotate graphs “Failover drill [YYYY‑MM‑DD]”.
  • Run 10 requests on the bypass path; confirm parity in responses.
  • File issues for any broken env vars, stale keys, or missing alerts.

Closeout (3 min)

  • Post a 3‑line summary in on‑call channel: “TTR=[SECONDS], p95=[VALUE]s, SR=[VALUE]%.”
  • Log follow‑ups; tie to error‑budget policy if targets were missed.

Client‑safe SLO language (proposal/SOW copy)

Copy into proposals/SOWs. Keep it plain and measurable. This is not legal advice; have your counsel review.

SOW clause: Reliability SLO

  • Provider: We operate an AI inference service (the “Service”) that routes across approved model providers. We commit to the Service Level Objectives below, measured at our API edge over a rolling [WINDOW_DAYS]‑day window.
  • SLOs: Success rate ≥ [SLO_SUCCESS_RATE]%; p95 end‑to‑end latency ≤ [SLO_P95_LATENCY_SECONDS] seconds; average cost/request ≤ $[SLO_COST_AVG_USD].
  • Measurement: Metrics are collected via standardized telemetry across providers. We will provide access to monthly reports and on‑request drill results.
  • Error budget and remediation: If the error budget is exhausted within a window, we will (1) pause changes that risk reliability, (2) prioritize mitigations (provider pinning, prompt caching, context reduction), and (3) run a joint review.
  • Exclusions: Scheduled maintenance (≤ [MAINT_MINUTES] minutes, [MAINT_COUNT] per quarter, announced 48 hours in advance), third‑party abuse, or use outside agreed rate/volume limits.
  • Credits: If we miss the success‑rate SLO for a full [WINDOW_DAYS]‑day window, we will apply a service credit of [CREDIT_%]% of that month’s AI usage fees.

Proposal appendix: Operational transparency

  • We run a 30‑minute failover drill at least quarterly and on major provider changes. We share the drill summary and any reliability changes that follow.

Observability + alerts (what to send, where to watch)

Minimum wiring to make the SLOs visible and enforceable.

What to emit on every request

  • Identifiers: tenant_id, user_id, feature, request_id.
  • Provider details: provider, model, region (if applicable).
  • Performance: latency_ms_total, ttft_ms (time‑to‑first‑token), retries, failover_hops.
  • Cost: input_tokens, output_tokens, cost_usd.
  • Cache: cache_hit (true/false), cache_ttl_s.

OpenTelemetry GenAI + router callbacks

  • Add OpenTelemetry GenAI spans around your /ai handler; export to your metrics store (Grafana/Datadog/etc.).
  • Enable your router’s success callbacks to stream events to Langfuse/Helicone. Tag traces with feature and tenant_id so alerts can filter cleanly.

Alert presets (start here; tune after Week 1)

  • Latency: alert when p95(latency_ms_total) > [P95_TARGET]s for 15 minutes on feature=* and per tenant_id.
  • Success rate: alert when &lt; [SUCCESS_TARGET]% for 5 minutes (page) and 60 minutes (incident).
  • Cost: alert when avg(cost_usd) > $[COST_TARGET] for 15 minutes; monthly budget at 60/85/95%.

Runbooks to keep nearby

  • “Bypass router” switch + test client.
  • “Pin provider” instructions for hot paths.
  • “Reduce cost now” steps: enable prompt caching; trim system prompts; cap output tokens; switch to cheaper alias for non‑critical features.

Documentation hygiene

  • Put Last updated: [YYYY‑MM‑DD] on the SLO page and keep a short changelog when swapping providers or changing targets.