Nomad‑Proof Model Failover SOP (+ JSON Router & Code Snippets)
A copy‑and‑ship SOP for nomad founders to implement a 3‑tier LLM failover in a weekend: Hot (alt model/endpoint), Warm (different provider), Cold (queue + human‑in‑loop) with a budget pause tied to spend caps. Includes a JSON routing map, Node/Python wrappers, a Redis/BullMQ queue with a pause flag, and a 15‑minute “Lisbon Test.”
Use this SOP when your product or client delivery depends on LLM APIs and a single‑vendor outage or runaway bill could halt work while you’re on the road. Owners: founder/lead dev. Goal: ship a 3‑tier failover in a weekend that (a) auto‑routes requests across Hot → Warm → Cold, (b) pauses safely at a budget threshold, and (c) passes the “Lisbon Test” (works on café Wi‑Fi and async teams). Expected outcome: a working router, a cold queue with a pause flag, and a budget guardrail wired to spend caps.
- 1
Define scope, SLOs, and degrade modes
List the exact jobs this router must protect (e.g., summarize call notes, draft email, run RAG answer). For each, set a max time‑to‑answer (e.g., 30s) and a graceful degrade if both Hot and Warm fail (e.g., shorter context, smaller model, or queue for human review). Capture these in a short YAML so engineers and VAs know what “good enough” means.
- 2
Provision credentials and environment
Create a .env (or secrets manager entries) for two Hot routes on the same provider and two Warm routes on different providers. Add Redis for queues and a pause flag. Example keys:
ANTHROPIC_BASE_URL=https://your-gateway.example.com/v1 ANTHROPIC_BASE_URL_2=https://your-gateway-2.example.com/v1 ANTHROPIC_KEY=sk-live-... OPENAI_API_KEY=sk-live-... GEMINI_OPENAI_COMPAT=https://gemini-openai-compat.example.com/v1 GEMINI_API_KEY=ya29.... REDIS_URL=redis://localhost:6379/0 BUDGET_PAUSE_AT_PERCENT=80 - 3
Author the routing map (JSON)
Create a provider‑agnostic map the app can read at boot. Treat “Hot” as alternate model/endpoint on the same provider; treat “Warm” as different provider equivalents; “Cold” defines the queue and notify targets.
{ "hot": [ { "provider": "anthropic", "model": "sonnet-4.6", "baseUrl": "${ANTHROPIC_BASE_URL}" }, { "provider": "anthropic", "model": "haiku-4.5", "baseUrl": "${ANTHROPIC_BASE_URL_2}" } ], "warm": [ { "provider": "openai", "model": "gpt-4o-mini", "baseUrl": "https://api.openai.com/v1" }, { "provider": "google", "model": "gemini-3-flash", "baseUrl": "${GEMINI_OPENAI_COMPAT}" } ], "cold": { "mode": "graceful_degrade", "queue": "ai-tasks", "notify": ["ops@yourco.com"] } }Note: OpenAI‑compatible endpoints for non‑OpenAI providers typically require a gateway (e.g., OpenWebUI, Ollama‑Proxy, or your edge). Test response schemas, tool‑calling, and JSON modes before trusting Warm tier in production.
- 4
Implement the Node router with retries + circuit breaker (Hot/Warm)
Install libs:
npm i cockatiel undici. Expected outcome: calls try Hot routes with jittered retries; on consecutive failures, the circuit opens and traffic goes to Warm; if all fail, we enqueue Cold.import { Policy, handleAll, retry, exponentialBackoff, circuitBreaker, ConsecutiveBreaker } from 'cockatiel'; import { fetch } from 'undici'; import { enqueueCold } from './queue'; const breaker = new ConsecutiveBreaker(5); // open after 5 consecutive failures const breakerPolicy = circuitBreaker(breaker); const retryPolicy = retry(handleAll, { maxAttempts: 3, backoff: exponentialBackoff(200), jitter: true }); const policy = Policy.wrap(retryPolicy, breakerPolicy); async function callOpenAICompat(baseUrl: string, apiKey: string, body: any) { return fetch(`${baseUrl}/chat/completions`, { method: 'POST', headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json', 'Idempotency-Key': body.idempotencyKey }, body: JSON.stringify(body), }); } export async function routeWithFailover(body: any, routes: any) { for (const r of routes.hot) { if (!breaker.isOpen()) { const res = await policy.execute(() => callOpenAICompat(r.baseUrl, process.env.ANTHROPIC_KEY!, body)); if (res.ok) return res; } } for (const r of routes.warm) { const key = r.provider === 'openai' ? process.env.OPENAI_API_KEY! : process.env.GEMINI_API_KEY!; const base = r.provider === 'openai' ? r.baseUrl : process.env.GEMINI_OPENAI_COMPAT!; const res = await policy.execute(() => callOpenAICompat(base, key, body)); if (res.ok) return res; } await enqueueCold(body); // queue + notify human return new Response(JSON.stringify({ status: 'queued' }), { status: 202 }); } - 5
Provide a Python variant for services/workers
Install:
pip install tenacity pybreaker requests redis. Use jittered retries + circuit breaker. Expected outcome: parity with Node wrapper for workers or scripts.import requests, json, redis, time from tenacity import retry, stop_after_attempt, wait_exponential_jitter import pybreaker r = redis.Redis.from_url("${REDIS_URL}") breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30) @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(initial=0.2, max=2.0)) @breaker def call_openai_compat(base_url, api_key, body): return requests.post(f"{base_url}/chat/completions", headers={"Authorization": f"Bearer {api_key}", "Content-Type":"application/json", "Idempotency-Key": body.get("idempotencyKey","")}, data=json.dumps(body), timeout=30) - 6
Set health probes and breaker thresholds
Start with: open after 5 consecutive failures (per route), half‑open probe every 30s, success threshold 2/2 to close. Track p50/p95 latency and error rate per route. Add a passive health score: last 20 calls with a 0/1 success array so your dashboard shows which routes are healthy at a glance.
- 7
Implement the Cold tier queue + pause flag
Install:
npm i bullmq ioredis. Expected outcome: failed or paused traffic lands in a durable queue; workers respect a budget pause flag.import { Queue, Worker, JobsOptions } from 'bullmq'; import IORedis from 'ioredis'; export const conn = new IORedis(process.env.REDIS_URL!); export const aiQueue = new Queue('ai-tasks', { connection: conn }); export async function enqueueCold(payload: any) { const opts: JobsOptions = { attempts: 5, backoff: { type: 'exponential', delay: 1000 } }; await aiQueue.add('llm-task', payload, opts); } new Worker('ai-tasks', async job => { if (await conn.get('pause_ai_tasks') === '1') throw new Error('Paused by budget guardrail'); // process with routeWithFailover(...) }, { connection: conn }); - 8
Add a budget guardrail tied to Spend Caps
Set a Gemini Project Spend Cap (AI Studio → Spend tab). Note: enforcement may lag by ~10 minutes and a $0 prepay balance at the billing account stops all keys. Add an app‑level early pause at N% of monthly budget (default 80%) so you never rely solely on the hard stop. Expose a webhook to flip a Redis pause flag from any billing alert pipeline.
// Express-style webhook app.post('/budget/webhook', async (req,res)=>{ const { provider, percentUsed } = req.body; // e.g., { provider:'gemini', percentUsed:83 } if (provider==='gemini' && percentUsed >= Number(process.env.BUDGET_PAUSE_AT_PERCENT||'80')) { await conn.set('pause_ai_tasks','1'); } res.sendStatus(204); }); app.post('/budget/resume', async (_req,res)=>{ await conn.del('pause_ai_tasks'); res.sendStatus(204); });Source of
percentUsed: your billing alerts (e.g., Cloud Billing Budgets → Pub/Sub → small function that calls this webhook) or a daily cron that checks spend and posts the value. Treat project caps as last‑resort; rely on app‑level pause for graceful control. - 9
Normalize request/response, idempotency, and rate limits
Wrap all mutating external calls with an
Idempotency-Key(hash of userID+jobID+input). Keep token budgets per provider and a per‑tenant rate limiter to avoid hitting global throttles during partial outages. Document schema differences (e.g., tool‑calling and JSON modes) between providers and test them behind a feature flag before enabling Warm in production. - 10
Wire notifications and runbook
On Cold enqueue or budget pause: send Slack/email with job IDs, tenant, and next action (retry window, who is on‑call). Add a 1‑page runbook: how to toggle
pause_ai_tasks, how to move jobs from DLQ, and who communicates to clients. On release: post a message template your VA can paste to clients if you degrade service for >15 minutes. - 11
Run the Lisbon Test (15 minutes)
Simulate bad Wi‑Fi and time‑zone delays: (1) Block Hot provider domains locally; confirm Warm takes over within 1–2 attempts. (2) Force 500s from gateway; confirm circuit opens and Cold enqueues. (3) POST
/budget/webhookwith{ provider:'gemini', percentUsed: 85 }; confirm new work pauses, existing jobs finish, and notifications fire. Pass if: no request spins > your SLO, budget pause is honored in <5s, and the queue/notifications give a clear handoff to a human. - 12
Rollout plan and toggles
Stage → canary → full. Start with one Warm provider. Ship feature flags:
enable_warm_fallback,enable_cold_queue, and per‑model kill‑switches. Keep aroutes.local.jsonfor tests androutes.prod.jsonfor production; diff them in PR so reviewers see routing changes explicitly. - 13
Cost/latency spreadsheet tab (lightweight)
Create a tab with editable inputs: monthly requests, avg tokens/request, token prices by provider/model, Redis cost, failure rate estimates, retry attempts. Outputs: expected monthly cost, p95 latency with retries (assume 200ms base + jitter), and missed‑deadline risk (probability both Hot and Warm fail within SLO). Use simple formulas—this is for planning, not precision.
- 14
Operate and review monthly
Add dashboards for error rate per route, breaker open time, queue depth, and budget % used. Run a 30‑minute retro monthly: did the router save a deadline? Did budget pauses trigger too early? Adjust breaker thresholds, Warm models, and pause percent accordingly. Keep a CHANGELOG with exact dates of external incidents to remind the team why this exists (e.g., Apr 6–7 Claude elevated errors; Mar 4 OpenAI API errors).