Episode 5·May 13, 2026

Build AI-First SOPs That Survive Model Changes

Spotify Apple Podcasts RSS Feed Open Companion Resource

Intro

This episode is for nomad founders managing AI workflows with distributed teams who are tired of prompts breaking when models change. You'll get a complete AI-first SOP template that treats prompts as versioned assets, plus a 30-day review process that prevents surprise regressions.

In This Episode

Santi and Kira break down why normal SOPs fail for AI workflows and build the fix from scratch. They walk through the essential components of an AI-first SOP: metadata headers with model versions and temperature bands, semantic versioning for prompts, input/output schemas, failure modes documentation, and guardrails. They compare tooling options from Notion databases to PromptLayer and Humanloop, share real examples from Speak and Meticulate scaling with versioned prompts, and establish a 30-day recurring review tied to provider deprecation schedules. The episode includes a complete Notion template with three worked examples you can implement today.

Key Takeaways

Pin your prompts to specific model versions and dates, not just provider names, so you know exactly what was running when something breaks
Use semantic versioning (major.minor.patch) for prompt changes so anyone can tell if an update might break downstream systems
Set up a 30-day recurring review of OpenAI, Azure, Anthropic, and Vertex AI deprecation pages to catch model retirements before they surprise you

Timestamps

Companion Resource

template

AI‑First SOP (Notion) — Versioned Prompts + 30‑Day Change‑Control

A vendor‑agnostic Notion template that turns your prompts into a maintainable SOP: versioned steps, pinned model tags, guardrails, monitoring, a rollback plan, and a 30‑day change‑control loop—plus three filled mini‑examples (automation, content, support).

OpenAI blog + Help Center
openai.com
- - OpenAI announced retirement of GPT‑4o, GPT‑4.1, GPT‑4.1 mini, and o4‑mini from ChatGPT effective February 13, 2026; OpenAI noted no API changes at that time.
OpenAI API Docs — Deprecations
platform.openai.com
- - OpenAI maintains a public API deprecations page listing upcoming removals and recommended replacements (e.g., `chatgpt-4o-latest` removal Feb 17, 2026).
Microsoft Learn — Azure OpenAI Model Retirements
learn.microsoft.com
- - Azure OpenAI publishes model retirement tables with deprecation and retirement dates by model/version, plus replacements.
Anthropic Docs — Model deprecations
docs.anthropic.com
- - Anthropic documents Claude model deprecations and retirement dates (e.g., Claude 3 Opus retired Jan 5, 2026).
Google Cloud — Vertex AI Generative AI deprecations
cloud.google.com
- - Vertex AI maintains a Generative AI deprecations page and SDK deprecation schedules (e.g., Generative AI module in Vertex AI SDK deprecated June 24, 2025 with retirement June 24, 2026).
NIST AI RMF 1.0
nist.gov
- - NIST AI Risk Management Framework (AI RMF 1.0) recommends continuous monitoring and change management across the AI lifecycle.
OWASP Top 10 for LLM Applications v2.0 (PDF)
owasp.org
- - OWASP Top 10 for LLM Applications (v2.0, 2025) catalogs common failure modes including prompt injection, data leakage, and sensitive actions without authorization.
OpenAI API Docs — Temperature
platform.openai.com
- - OpenAI documents the `temperature` parameter as controlling randomness (0 = most deterministic; up to 2 = most random).
AWS News Blog + Bedrock Guardrails docs
aws.amazon.com
- - Amazon Bedrock Guardrails allows attaching policy guardrails by ID+version to inference calls and now supports cross‑account safeguards.
NVIDIA NeMo Guardrails docs
docs.nvidia.com
- - NVIDIA NeMo Guardrails provides programmable rails (input/output/tool rails) to constrain behavior, with open‑source docs and APIs.
Notion Help Center (Version history; Database properties)
notion.com
- - Notion supports page Version history and Last edited time/Last edited by database properties, enabling manual change‑logs and ownership tracking.
PromptLayer docs — How it works; Release labels
docs.promptlayer.com
- - PromptLayer provides a prompt registry with versioning, release labels, analytics, and evaluations; prompts can be executed across providers and rolled back.
Humanloop Docs — Prompts & Integrations
humanloop.com
- - Humanloop offers prompt management with version control, human‑in‑the‑loop review, and a human‑readable `.prompt` file format designed for Git.
PromptLayer blog case study
promptlayer.com
- - Speak (language learning app) scaled localization and support content using PromptLayer
- - Demonstrates a real team putting prompt versioning and collaborative editing into practice so non‑technical staff can safely ship changes.
PromptLayer blog case study
blog.promptlayer.com
- - Meticulate scaled to 1.5M LLM requests using PromptLayer logging + versioned prompts
- - Illustrates how a versioned prompt library plus run logs speeds debugging and reduces downtime when prompts regress at scale.
OpenAI Help Center + Blog; GitHub Changelog
openai.com
- - Model retirements impacting end‑user surfaces (ChatGPT) and partner products (GitHub Copilot)
- - Concrete trigger for change‑control: when providers retire models, prompts must be reviewed and sometimes re‑tuned on a deadline.
Notion Help Center (Database properties & Buttons)
notion.com
- - Small‑team Notion workflow for prompt versioning and change‑logs
- - Shows how a lightweight, provider‑agnostic system can be implemented in Notion using built‑in history and database fields when dedicated tools are overkill.

Santi: Your prompts are documented. You've got a Notion page, maybe a Google Doc, maybe a whole folder. Every workflow has instructions. Your contractors can find them.

Kira: And your prompts are going to break.

Santi: Both of those things are true at the same time. That's the problem.

Kira: When did you figure that out?

Santi: February. OpenAI retired GPT-4o from ChatGPT — along with 4.1, 4.1 mini, o4-mini — February thirteenth, twenty twenty-six. Hard cutoff. And I had a content repurposing workflow that was pinned to 4o behavior. Not pinned in the SOP — pinned in my head. The SOP just said "use GPT-4o" with no version date, no fallback, no — nothing.

Kira: So the model disappears and your doc is still sitting there saying "use GPT-4o" like nothing happened.

Santi: Like a restaurant menu for a place that closed. My contractor in Manila runs the workflow, gets completely different outputs, and messages me at three AM Lisbon time asking what changed.

Kira: And you didn't know either.

Santi: I didn't know yet. I was asleep. She spent two hours trying to debug a prompt that wasn't broken — the model underneath it was just gone.

Kira: That gap — between "we have docs" and "our docs survive change" — is where most nomad teams are operating right now. You've got prompts scattered across Notion pages and Slack threads and your contractor's personal notes, and none of it tells anyone which model version it was written for, what to do when that model gets deprecated, or how to roll back when something breaks at two AM in a time zone you're not in.

Santi: So today we're building the fix. One AI SOP template — not a generic SOP, an AI-first SOP — with versioned prompts, pinned model tags, guardrails, a rollback plan, and a thirty-day change review that catches provider deprecations before they catch you. You'll walk away with the exact schema and a Notion template you can duplicate today.

Kira: So let's start with why normal SOPs don't work for AI workflows. Because most of us have SOPs. We're not undocumented. The problem is what's missing.

Santi: Right. A traditional SOP says "do this, then this, then this." Step one, step two, step three. And that's fine for a process where the tool doesn't change underneath you. But an AI workflow has a dependency that no other business process has — the model itself is a moving target. OpenAI, Anthropic, Google, Azure — they all publish deprecation schedules. Claude three Opus retired January fifth this year. Vertex AI deprecated their generative AI SDK module last June with a hard retirement date of June twenty twenty-six. These aren't hypotheticals. These are calendar events.

Kira: And the SOP doesn't know about any of them.

Santi: The SOP doesn't know. It says "use Claude" or "use GPT-4o" and it has no concept of version, no concept of expiration, no fallback. So when the model changes — and it will change — your contractor is stuck.

Kira: This happened to one of my contractors last month. She runs a client onboarding sequence that uses an LLM to draft welcome emails. The prompt was tuned for a specific model's tone. When the model updated, the emails started coming out weirdly formal. Like a bank writing to a customer. And she didn't flag it because the SOP said "run the prompt" and she ran the prompt. The output was different, but the SOP didn't tell her what "right" looked like.

Santi: And that's the first thing an AI-first SOP fixes. You pin the model and version — not just "GPT-4o" but the specific release date. OpenAI tags them, Anthropic tags them, everyone tags them. Your SOP header says the provider, the model, the version date. So when something changes, you know exactly what you were running before.

Kira: Okay, but that's just one field. What else goes in this header?

Santi: The header is your metadata block. Owner name, backup owner — critical for async teams — status, whether it's draft or approved or deprecated. Then the version number for the SOP itself, the model tag, and a temperature band.

Kira: Wait — explain temperature band for people who haven't touched that setting.

Santi: Temperature controls how random the model's output is. Zero is deterministic — same input, nearly same output every time. Two is maximum randomness. For a refund triage workflow where you need consistent, policy-compliant responses, you want low — zero to point two. For a creative content draft, you might go medium, point three to point six. The SOP should declare which band this workflow lives in, because if someone cranks the temperature on a compliance task, you get wildly different outputs and no one knows why.

Kira: So the header alone already has more information than most people's entire SOP.

Santi: And we haven't gotten to the steps yet.

Kira: Walk me through the steps section, because this is where it gets real for my team.

Santi: Each step that involves a model call gets its own prompt key — a unique name — and its own version number. And this is the important part — you version prompts the same way developers version software. Semantic Versioning. SemVer. Major dot minor dot patch.

Kira: Major, minor, patch — what triggers each one?

Santi: If you change the output shape — like the prompt used to return plain text and now it returns JSON — that's a major bump. If you change the instructions but the output contract stays the same, that's minor. If you fix a typo or tweak a threshold, that's a patch. The point is that anyone looking at the version number knows whether this change could break downstream systems.

Kira: So if I see a prompt go from one-dot-two to two-dot-zero, I know something structural changed and I need to check everything that depends on it.

Santi: Exactly. And you pair that version with the model tag and — if you're running evals — a dataset hash. So the full label looks like your prompt key, at version one-dot-three-dot-two, hash the model tag, plus the dataset hash. It's ugly. But it's unambiguous. Six months from now, you can look at that label and know exactly what was running.

Kira: And the inputs and outputs — those get schemas?

Santi: They should. Even if it's just a description of what fields go in and what fields come out. For teams that are more technical, you can use JSON Schema. For everyone else, a simple table works — field name, type, required or optional, description. The point is that your contractor doesn't have to guess what the prompt expects.

Kira: This is where I want to push back a little, because I can already hear people thinking — this is a lot of overhead for a three-person team.

Santi: It can be. And I'll be honest — if you're solo and you're the only person touching your prompts, some of this is overkill. But the moment you have one contractor, one collaborator, one person who isn't you — the overhead of not having this is worse. It's the three AM message from Manila. It's the two hours of debugging a prompt that wasn't broken.

Kira: And the overhead scales with the tool you choose. You don't need PromptLayer or Humanloop on day one.

Santi: No. And this is where I think people get stuck — they see "prompt versioning" and they think they need a platform. You don't. If you're three people or fewer, a Notion database works. You set up a database with properties for owner, status, SemVer, model tag, last edited time. Notion gives you page history for diffs. You add a button that stamps a changelog entry when you publish a new version. That's it. That's your prompt registry.

Kira: I actually set this up for my agency two months ago. Took maybe forty-five minutes. And the thing that surprised me was how much my contractors liked it. They said it was the first time they could look at a prompt and know whether it was current.

Santi: That's the win. It's not about the tool — it's about the metadata.

Kira: Now, for bigger teams — what changes?

Santi: When you have non-technical editors who need to safely change prompts without breaking production, that's when dedicated tools earn their cost. PromptLayer gives you a prompt registry with release labels, rollback, analytics. Speak — the language learning app — used it to scale from one market to eleven in a year. They trained non-technical content teams to version and edit prompts without engineers redeploying anything.

Kira: Non-technical teams editing prompts safely. That's the dream for agency operators.

Santi: Humanloop does something similar but with a dot-prompt file format that syncs to Git. So if you're code-first, that's your path. But — and this is a real caveat — Humanloop's own docs flagged a platform sunset notice in twenty twenty-five. Which is actually the perfect illustration of why your versioning scheme needs to live in your SOP, not just in the tool.

Kira: Because the tool can disappear too.

Santi: The tool can disappear. The SemVer convention, the model tags, the changelog — those survive any platform migration. That's the layer you protect.

Kira: Okay, so we've got the header, the versioned steps, the inputs and outputs. What about when things go wrong?

Santi: Every AI-first SOP needs a failure modes section. And this isn't theoretical — OWASP published their Top Ten for LLM Applications, version two, in twenty twenty-five. It's a catalog of the specific ways LLM workflows break. Prompt injection, data leakage, unsandboxed tool use. Your SOP should name the failure modes that apply to this specific workflow and document what catches them.

Kira: Give me a concrete example.

Santi: Say you've got a support triage workflow. The model reads a customer message and drafts a reply. Failure mode one — the customer embeds instructions in their message that hijack the prompt. That's injection. Your mitigation is a strong system message plus input sanitization. Failure mode two — the model leaks another customer's data in the response. Mitigation is a PII scrub on the output. You write these down. You attach guardrail policy IDs if you're using something like AWS Bedrock Guardrails or NeMo Guardrails. And you version those policies too.

Kira: And the temperature band matters here — you're not running a support triage at temperature point eight.

Santi: No. That's how you get a refund bot that starts improvising poetry. Low temperature, strict system message, output validation. The SOP declares all of it.

Kira: So the SOP is built. Prompts are versioned. Guardrails are documented. Now — how do you keep it alive? Because the whole point of this episode is that models change on a schedule you don't control.

Santi: Thirty-day recurring review. Put it on your calendar. And here's what you check — OpenAI's deprecations page, Azure's model retirement tables, Anthropic's deprecation docs, Vertex AI's deprecation page. Four bookmarks. You open them, you check whether any model you're using has a new retirement date or a replacement recommendation.

Kira: And if something's flagged?

Santi: You pull the affected SOPs, rerun your evals on the replacement model — even if it's just five test cases — and if the outputs hold, you update the model tag and bump the version. If they don't hold, you've got time to patch the prompt before the deadline hits. That's the whole point of the cadence — you find out in week one, not on retirement day.

Kira: Meticulate — the company that scaled to one-point-five million LLM requests — they did something similar. They tagged every call by function and model in PromptLayer, so when a prompt regressed, they could search failing runs, find the version that worked, and roll back. The versioned workflow was what let them hotfix in hours instead of days.

Santi: And that's at scale. For a three-person team, the same principle applies — you just do it in Notion instead of PromptLayer. The review takes twenty minutes. The alternative is finding out at three AM from a confused contractor.

Kira: Or from a client. On LinkedIn. In front of eleven thousand followers.

Santi: We're not doing that callback again.

Kira: We're absolutely doing that callback. Because that's the cost. The SOP isn't overhead — it's insurance. And the thirty-day review is the premium.

Santi: And the premium is twenty minutes a month. That's it. Twenty minutes to check four pages, rerun a handful of test cases, and update a version number. If that feels like too much overhead, you're underpricing the cost of breakage.

Kira: And this is the important part — the review also catches pricing changes. Anthropic adjusts token pricing, OpenAI shifts rate limits, and if you're not checking monthly, your margins drift without you noticing. We talked about that in the cost meter episode. This is the same discipline applied to your prompts.

Santi: Same discipline, different surface. Monitor your spend, monitor your models, monitor your prompts. Thirty days. Every time.

Kira: So here's where we started — Santi's contractor in Manila, three AM, debugging a prompt that wasn't broken because the model underneath it had been retired. And the SOP sitting there like nothing happened.

Santi: That doesn't happen anymore. Not because I got smarter — because the SOP got smarter. The header tells you what model you're running. The version number tells you what changed. The failure modes tell you what to watch for. And the thirty-day review tells you when to check. It's not complicated. It's just... complete.

Kira: And if you want to skip the part where you build this from scratch — we put together the AI-First SOP template on the Resources page. It's a Notion template. Duplicate it, fill in your model tag and your inputs, and you've got versioned prompts with a built-in changelog by the end of today. Three worked examples are already in there — automation, content, and support.

Santi: One thing to do this week. Pick your most critical AI workflow — the one that would hurt the most if it broke tomorrow — and build its SOP first. Just one. Pin the model version, write the failure modes, set the thirty-day review date. That's it. You can do the rest later. Start with the one that scares you.

Kira: That's the move. Start with the one that scares you.

Santi: See you Wednesday.

Kira: See you Wednesday.

AI workflowsprompt versioningSOP templatesmodel deprecationsremote team managementAI operationsprompt engineeringworkflow automationchange managementdigital nomad business