SOP

Nomad‑Proof Model Failover SOP (+ JSON Router & Code Snippets)

A copy‑and‑ship SOP for nomad founders to implement a 3‑tier LLM failover in a weekend: Hot (alt model/endpoint), Warm (different provider), Cold (queue + human‑in‑loop) with a budget pause tied to spend caps. Includes a JSON routing map, Node/Python wrappers, a Redis/BullMQ queue with a pause flag, and a 15‑minute “Lisbon Test.”

Use this SOP when your product or client delivery depends on LLM APIs and a single‑vendor outage or runaway bill could halt work while you’re on the road. Owners: founder/lead dev. Goal: ship a 3‑tier failover in a weekend that (a) auto‑routes requests across Hot → Warm → Cold, (b) pauses safely at a budget threshold, and (c) passes the “Lisbon Test” (works on café Wi‑Fi and async teams). Expected outcome: a working router, a cold queue with a pause flag, and a budget guardrail wired to spend caps.

  1. 1

    Define scope, SLOs, and degrade modes

    List the exact jobs this router must protect (e.g., summarize call notes, draft email, run RAG answer). For each, set a max time‑to‑answer (e.g., 30s) and a graceful degrade if both Hot and Warm fail (e.g., shorter context, smaller model, or queue for human review). Capture these in a short YAML so engineers and VAs know what “good enough” means.

  2. 2

    Provision credentials and environment

    Create a .env (or secrets manager entries) for two Hot routes on the same provider and two Warm routes on different providers. Add Redis for queues and a pause flag. Example keys:

    ANTHROPIC_BASE_URL=https://your-gateway.example.com/v1
    ANTHROPIC_BASE_URL_2=https://your-gateway-2.example.com/v1
    ANTHROPIC_KEY=sk-live-...
    OPENAI_API_KEY=sk-live-...
    GEMINI_OPENAI_COMPAT=https://gemini-openai-compat.example.com/v1
    GEMINI_API_KEY=ya29....
    REDIS_URL=redis://localhost:6379/0
    BUDGET_PAUSE_AT_PERCENT=80
    
  3. 3

    Author the routing map (JSON)

    Create a provider‑agnostic map the app can read at boot. Treat “Hot” as alternate model/endpoint on the same provider; treat “Warm” as different provider equivalents; “Cold” defines the queue and notify targets.

    {
      "hot": [
        { "provider": "anthropic", "model": "sonnet-4.6", "baseUrl": "${ANTHROPIC_BASE_URL}" },
        { "provider": "anthropic", "model": "haiku-4.5",  "baseUrl": "${ANTHROPIC_BASE_URL_2}" }
      ],
      "warm": [
        { "provider": "openai",  "model": "gpt-4o-mini",     "baseUrl": "https://api.openai.com/v1" },
        { "provider": "google",  "model": "gemini-3-flash",  "baseUrl": "${GEMINI_OPENAI_COMPAT}" }
      ],
      "cold": { "mode": "graceful_degrade", "queue": "ai-tasks", "notify": ["ops@yourco.com"] }
    }
    

    Note: OpenAI‑compatible endpoints for non‑OpenAI providers typically require a gateway (e.g., OpenWebUI, Ollama‑Proxy, or your edge). Test response schemas, tool‑calling, and JSON modes before trusting Warm tier in production.

  4. 4

    Implement the Node router with retries + circuit breaker (Hot/Warm)

    Install libs: npm i cockatiel undici. Expected outcome: calls try Hot routes with jittered retries; on consecutive failures, the circuit opens and traffic goes to Warm; if all fail, we enqueue Cold.

    import { Policy, handleAll, retry, exponentialBackoff, circuitBreaker, ConsecutiveBreaker } from 'cockatiel';
    import { fetch } from 'undici';
    import { enqueueCold } from './queue';
    
    const breaker = new ConsecutiveBreaker(5); // open after 5 consecutive failures
    const breakerPolicy = circuitBreaker(breaker);
    const retryPolicy = retry(handleAll, { maxAttempts: 3, backoff: exponentialBackoff(200), jitter: true });
    const policy = Policy.wrap(retryPolicy, breakerPolicy);
    
    async function callOpenAICompat(baseUrl: string, apiKey: string, body: any) {
      return fetch(`${baseUrl}/chat/completions`, {
        method: 'POST',
        headers: { 'Authorization': `Bearer ${apiKey}`, 'Content-Type': 'application/json', 'Idempotency-Key': body.idempotencyKey },
        body: JSON.stringify(body),
      });
    }
    
    export async function routeWithFailover(body: any, routes: any) {
      for (const r of routes.hot) {
        if (!breaker.isOpen()) {
          const res = await policy.execute(() => callOpenAICompat(r.baseUrl, process.env.ANTHROPIC_KEY!, body));
          if (res.ok) return res;
        }
      }
      for (const r of routes.warm) {
        const key = r.provider === 'openai' ? process.env.OPENAI_API_KEY! : process.env.GEMINI_API_KEY!;
        const base = r.provider === 'openai' ? r.baseUrl : process.env.GEMINI_OPENAI_COMPAT!;
        const res = await policy.execute(() => callOpenAICompat(base, key, body));
        if (res.ok) return res;
      }
      await enqueueCold(body); // queue + notify human
      return new Response(JSON.stringify({ status: 'queued' }), { status: 202 });
    }
    
  5. 5

    Provide a Python variant for services/workers

    Install: pip install tenacity pybreaker requests redis. Use jittered retries + circuit breaker. Expected outcome: parity with Node wrapper for workers or scripts.

    import requests, json, redis, time
    from tenacity import retry, stop_after_attempt, wait_exponential_jitter
    import pybreaker
    r = redis.Redis.from_url("${REDIS_URL}")
    breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30)
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(initial=0.2, max=2.0))
    @breaker
    def call_openai_compat(base_url, api_key, body):
        return requests.post(f"{base_url}/chat/completions",
            headers={"Authorization": f"Bearer {api_key}", "Content-Type":"application/json", "Idempotency-Key": body.get("idempotencyKey","")},
            data=json.dumps(body), timeout=30)
    
  6. 6

    Set health probes and breaker thresholds

    Start with: open after 5 consecutive failures (per route), half‑open probe every 30s, success threshold 2/2 to close. Track p50/p95 latency and error rate per route. Add a passive health score: last 20 calls with a 0/1 success array so your dashboard shows which routes are healthy at a glance.

  7. 7

    Implement the Cold tier queue + pause flag

    Install: npm i bullmq ioredis. Expected outcome: failed or paused traffic lands in a durable queue; workers respect a budget pause flag.

    import { Queue, Worker, JobsOptions } from 'bullmq';
    import IORedis from 'ioredis';
    export const conn = new IORedis(process.env.REDIS_URL!);
    export const aiQueue = new Queue('ai-tasks', { connection: conn });
    
    export async function enqueueCold(payload: any) {
      const opts: JobsOptions = { attempts: 5, backoff: { type: 'exponential', delay: 1000 } };
      await aiQueue.add('llm-task', payload, opts);
    }
    
    new Worker('ai-tasks', async job => {
      if (await conn.get('pause_ai_tasks') === '1') throw new Error('Paused by budget guardrail');
      // process with routeWithFailover(...)
    }, { connection: conn });
    
  8. 8

    Add a budget guardrail tied to Spend Caps

    Set a Gemini Project Spend Cap (AI Studio → Spend tab). Note: enforcement may lag by ~10 minutes and a $0 prepay balance at the billing account stops all keys. Add an app‑level early pause at N% of monthly budget (default 80%) so you never rely solely on the hard stop. Expose a webhook to flip a Redis pause flag from any billing alert pipeline.

    // Express-style webhook
    app.post('/budget/webhook', async (req,res)=>{
      const { provider, percentUsed } = req.body; // e.g., { provider:'gemini', percentUsed:83 }
      if (provider==='gemini' && percentUsed >= Number(process.env.BUDGET_PAUSE_AT_PERCENT||'80')) {
        await conn.set('pause_ai_tasks','1');
      }
      res.sendStatus(204);
    });
    app.post('/budget/resume', async (_req,res)=>{ await conn.del('pause_ai_tasks'); res.sendStatus(204); });
    

    Source of percentUsed: your billing alerts (e.g., Cloud Billing Budgets → Pub/Sub → small function that calls this webhook) or a daily cron that checks spend and posts the value. Treat project caps as last‑resort; rely on app‑level pause for graceful control.

  9. 9

    Normalize request/response, idempotency, and rate limits

    Wrap all mutating external calls with an Idempotency-Key (hash of userID+jobID+input). Keep token budgets per provider and a per‑tenant rate limiter to avoid hitting global throttles during partial outages. Document schema differences (e.g., tool‑calling and JSON modes) between providers and test them behind a feature flag before enabling Warm in production.

  10. 10

    Wire notifications and runbook

    On Cold enqueue or budget pause: send Slack/email with job IDs, tenant, and next action (retry window, who is on‑call). Add a 1‑page runbook: how to toggle pause_ai_tasks, how to move jobs from DLQ, and who communicates to clients. On release: post a message template your VA can paste to clients if you degrade service for >15 minutes.

  11. 11

    Run the Lisbon Test (15 minutes)

    Simulate bad Wi‑Fi and time‑zone delays: (1) Block Hot provider domains locally; confirm Warm takes over within 1–2 attempts. (2) Force 500s from gateway; confirm circuit opens and Cold enqueues. (3) POST /budget/webhook with { provider:&#39;gemini&#39;, percentUsed: 85 }; confirm new work pauses, existing jobs finish, and notifications fire. Pass if: no request spins > your SLO, budget pause is honored in <5s, and the queue/notifications give a clear handoff to a human.

  12. 12

    Rollout plan and toggles

    Stage → canary → full. Start with one Warm provider. Ship feature flags: enable_warm_fallback, enable_cold_queue, and per‑model kill‑switches. Keep a routes.local.json for tests and routes.prod.json for production; diff them in PR so reviewers see routing changes explicitly.

  13. 13

    Cost/latency spreadsheet tab (lightweight)

    Create a tab with editable inputs: monthly requests, avg tokens/request, token prices by provider/model, Redis cost, failure rate estimates, retry attempts. Outputs: expected monthly cost, p95 latency with retries (assume 200ms base + jitter), and missed‑deadline risk (probability both Hot and Warm fail within SLO). Use simple formulas—this is for planning, not precision.

  14. 14

    Operate and review monthly

    Add dashboards for error rate per route, breaker open time, queue depth, and budget % used. Run a 30‑minute retro monthly: did the router save a deadline? Did budget pauses trigger too early? Adjust breaker thresholds, Warm models, and pause percent accordingly. Keep a CHANGELOG with exact dates of external incidents to remind the team why this exists (e.g., Apr 6–7 Claude elevated errors; Mar 4 OpenAI API errors).