Santi: Your prompts are documented. You've got a Notion page, maybe a Google Doc, maybe a whole folder. Every workflow has instructions. Your contractors can find them.
Kira: And your prompts are going to break.
Santi: Both of those things are true at the same time. That's the problem.
Kira: When did you figure that out?
Santi: February. OpenAI retired GPT-4o from ChatGPT — along with 4.1, 4.1 mini, o4-mini — February thirteenth, twenty twenty-six. Hard cutoff. And I had a content repurposing workflow that was pinned to 4o behavior. Not pinned in the SOP — pinned in my head. The SOP just said "use GPT-4o" with no version date, no fallback, no — nothing.
Kira: So the model disappears and your doc is still sitting there saying "use GPT-4o" like nothing happened.
Santi: Like a restaurant menu for a place that closed. My contractor in Manila runs the workflow, gets completely different outputs, and messages me at three AM Lisbon time asking what changed.
Kira: And you didn't know either.
Santi: I didn't know yet. I was asleep. She spent two hours trying to debug a prompt that wasn't broken — the model underneath it was just gone.
Kira: That gap — between "we have docs" and "our docs survive change" — is where most nomad teams are operating right now. You've got prompts scattered across Notion pages and Slack threads and your contractor's personal notes, and none of it tells anyone which model version it was written for, what to do when that model gets deprecated, or how to roll back when something breaks at two AM in a time zone you're not in.
Santi: So today we're building the fix. One AI SOP template — not a generic SOP, an AI-first SOP — with versioned prompts, pinned model tags, guardrails, a rollback plan, and a thirty-day change review that catches provider deprecations before they catch you. You'll walk away with the exact schema and a Notion template you can duplicate today.
Kira: So let's start with why normal SOPs don't work for AI workflows. Because most of us have SOPs. We're not undocumented. The problem is what's missing.
Santi: Right. A traditional SOP says "do this, then this, then this." Step one, step two, step three. And that's fine for a process where the tool doesn't change underneath you. But an AI workflow has a dependency that no other business process has — the model itself is a moving target. OpenAI, Anthropic, Google, Azure — they all publish deprecation schedules. Claude three Opus retired January fifth this year. Vertex AI deprecated their generative AI SDK module last June with a hard retirement date of June twenty twenty-six. These aren't hypotheticals. These are calendar events.
Kira: And the SOP doesn't know about any of them.
Santi: The SOP doesn't know. It says "use Claude" or "use GPT-4o" and it has no concept of version, no concept of expiration, no fallback. So when the model changes — and it will change — your contractor is stuck.
Kira: This happened to one of my contractors last month. She runs a client onboarding sequence that uses an LLM to draft welcome emails. The prompt was tuned for a specific model's tone. When the model updated, the emails started coming out weirdly formal. Like a bank writing to a customer. And she didn't flag it because the SOP said "run the prompt" and she ran the prompt. The output was different, but the SOP didn't tell her what "right" looked like.
Santi: And that's the first thing an AI-first SOP fixes. You pin the model and version — not just "GPT-4o" but the specific release date. OpenAI tags them, Anthropic tags them, everyone tags them. Your SOP header says the provider, the model, the version date. So when something changes, you know exactly what you were running before.
Kira: Okay, but that's just one field. What else goes in this header?
Santi: The header is your metadata block. Owner name, backup owner — critical for async teams — status, whether it's draft or approved or deprecated. Then the version number for the SOP itself, the model tag, and a temperature band.
Kira: Wait — explain temperature band for people who haven't touched that setting.
Santi: Temperature controls how random the model's output is. Zero is deterministic — same input, nearly same output every time. Two is maximum randomness. For a refund triage workflow where you need consistent, policy-compliant responses, you want low — zero to point two. For a creative content draft, you might go medium, point three to point six. The SOP should declare which band this workflow lives in, because if someone cranks the temperature on a compliance task, you get wildly different outputs and no one knows why.
Kira: So the header alone already has more information than most people's entire SOP.
Santi: And we haven't gotten to the steps yet.
Kira: Walk me through the steps section, because this is where it gets real for my team.
Santi: Each step that involves a model call gets its own prompt key — a unique name — and its own version number. And this is the important part — you version prompts the same way developers version software. Semantic Versioning. SemVer. Major dot minor dot patch.
Kira: Major, minor, patch — what triggers each one?
Santi: If you change the output shape — like the prompt used to return plain text and now it returns JSON — that's a major bump. If you change the instructions but the output contract stays the same, that's minor. If you fix a typo or tweak a threshold, that's a patch. The point is that anyone looking at the version number knows whether this change could break downstream systems.
Kira: So if I see a prompt go from one-dot-two to two-dot-zero, I know something structural changed and I need to check everything that depends on it.
Santi: Exactly. And you pair that version with the model tag and — if you're running evals — a dataset hash. So the full label looks like your prompt key, at version one-dot-three-dot-two, hash the model tag, plus the dataset hash. It's ugly. But it's unambiguous. Six months from now, you can look at that label and know exactly what was running.
Kira: And the inputs and outputs — those get schemas?
Santi: They should. Even if it's just a description of what fields go in and what fields come out. For teams that are more technical, you can use JSON Schema. For everyone else, a simple table works — field name, type, required or optional, description. The point is that your contractor doesn't have to guess what the prompt expects.
Kira: This is where I want to push back a little, because I can already hear people thinking — this is a lot of overhead for a three-person team.
Santi: It can be. And I'll be honest — if you're solo and you're the only person touching your prompts, some of this is overkill. But the moment you have one contractor, one collaborator, one person who isn't you — the overhead of not having this is worse. It's the three AM message from Manila. It's the two hours of debugging a prompt that wasn't broken.
Kira: And the overhead scales with the tool you choose. You don't need PromptLayer or Humanloop on day one.
Santi: No. And this is where I think people get stuck — they see "prompt versioning" and they think they need a platform. You don't. If you're three people or fewer, a Notion database works. You set up a database with properties for owner, status, SemVer, model tag, last edited time. Notion gives you page history for diffs. You add a button that stamps a changelog entry when you publish a new version. That's it. That's your prompt registry.
Kira: I actually set this up for my agency two months ago. Took maybe forty-five minutes. And the thing that surprised me was how much my contractors liked it. They said it was the first time they could look at a prompt and know whether it was current.
Santi: That's the win. It's not about the tool — it's about the metadata.
Kira: Now, for bigger teams — what changes?
Santi: When you have non-technical editors who need to safely change prompts without breaking production, that's when dedicated tools earn their cost. PromptLayer gives you a prompt registry with release labels, rollback, analytics. Speak — the language learning app — used it to scale from one market to eleven in a year. They trained non-technical content teams to version and edit prompts without engineers redeploying anything.
Kira: Non-technical teams editing prompts safely. That's the dream for agency operators.
Santi: Humanloop does something similar but with a dot-prompt file format that syncs to Git. So if you're code-first, that's your path. But — and this is a real caveat — Humanloop's own docs flagged a platform sunset notice in twenty twenty-five. Which is actually the perfect illustration of why your versioning scheme needs to live in your SOP, not just in the tool.
Kira: Because the tool can disappear too.
Santi: The tool can disappear. The SemVer convention, the model tags, the changelog — those survive any platform migration. That's the layer you protect.
Kira: Okay, so we've got the header, the versioned steps, the inputs and outputs. What about when things go wrong?
Santi: Every AI-first SOP needs a failure modes section. And this isn't theoretical — OWASP published their Top Ten for LLM Applications, version two, in twenty twenty-five. It's a catalog of the specific ways LLM workflows break. Prompt injection, data leakage, unsandboxed tool use. Your SOP should name the failure modes that apply to this specific workflow and document what catches them.
Kira: Give me a concrete example.
Santi: Say you've got a support triage workflow. The model reads a customer message and drafts a reply. Failure mode one — the customer embeds instructions in their message that hijack the prompt. That's injection. Your mitigation is a strong system message plus input sanitization. Failure mode two — the model leaks another customer's data in the response. Mitigation is a PII scrub on the output. You write these down. You attach guardrail policy IDs if you're using something like AWS Bedrock Guardrails or NeMo Guardrails. And you version those policies too.
Kira: And the temperature band matters here — you're not running a support triage at temperature point eight.
Santi: No. That's how you get a refund bot that starts improvising poetry. Low temperature, strict system message, output validation. The SOP declares all of it.
Kira: So the SOP is built. Prompts are versioned. Guardrails are documented. Now — how do you keep it alive? Because the whole point of this episode is that models change on a schedule you don't control.
Santi: Thirty-day recurring review. Put it on your calendar. And here's what you check — OpenAI's deprecations page, Azure's model retirement tables, Anthropic's deprecation docs, Vertex AI's deprecation page. Four bookmarks. You open them, you check whether any model you're using has a new retirement date or a replacement recommendation.
Kira: And if something's flagged?
Santi: You pull the affected SOPs, rerun your evals on the replacement model — even if it's just five test cases — and if the outputs hold, you update the model tag and bump the version. If they don't hold, you've got time to patch the prompt before the deadline hits. That's the whole point of the cadence — you find out in week one, not on retirement day.
Kira: Meticulate — the company that scaled to one-point-five million LLM requests — they did something similar. They tagged every call by function and model in PromptLayer, so when a prompt regressed, they could search failing runs, find the version that worked, and roll back. The versioned workflow was what let them hotfix in hours instead of days.
Santi: And that's at scale. For a three-person team, the same principle applies — you just do it in Notion instead of PromptLayer. The review takes twenty minutes. The alternative is finding out at three AM from a confused contractor.
Kira: Or from a client. On LinkedIn. In front of eleven thousand followers.
Santi: We're not doing that callback again.
Kira: We're absolutely doing that callback. Because that's the cost. The SOP isn't overhead — it's insurance. And the thirty-day review is the premium.
Santi: And the premium is twenty minutes a month. That's it. Twenty minutes to check four pages, rerun a handful of test cases, and update a version number. If that feels like too much overhead, you're underpricing the cost of breakage.
Kira: And this is the important part — the review also catches pricing changes. Anthropic adjusts token pricing, OpenAI shifts rate limits, and if you're not checking monthly, your margins drift without you noticing. We talked about that in the cost meter episode. This is the same discipline applied to your prompts.
Santi: Same discipline, different surface. Monitor your spend, monitor your models, monitor your prompts. Thirty days. Every time.
Kira: So here's where we started — Santi's contractor in Manila, three AM, debugging a prompt that wasn't broken because the model underneath it had been retired. And the SOP sitting there like nothing happened.
Santi: That doesn't happen anymore. Not because I got smarter — because the SOP got smarter. The header tells you what model you're running. The version number tells you what changed. The failure modes tell you what to watch for. And the thirty-day review tells you when to check. It's not complicated. It's just... complete.
Kira: And if you want to skip the part where you build this from scratch — we put together the AI-First SOP template on the Resources page. It's a Notion template. Duplicate it, fill in your model tag and your inputs, and you've got versioned prompts with a built-in changelog by the end of today. Three worked examples are already in there — automation, content, and support.
Santi: One thing to do this week. Pick your most critical AI workflow — the one that would hurt the most if it broke tomorrow — and build its SOP first. Just one. Pin the model version, write the failure modes, set the thirty-day review date. That's it. You can do the rest later. Start with the one that scares you.
Kira: That's the move. Start with the one that scares you.
Santi: See you Wednesday.
Kira: See you Wednesday.