AI‑First SOP (Notion) — Versioned Prompts + 30‑Day Change‑Control
A vendor‑agnostic Notion template that turns your prompts into a maintainable SOP: versioned steps, pinned model tags, guardrails, monitoring, a rollback plan, and a 30‑day change‑control loop—plus three filled mini‑examples (automation, content, support).
Duplicate this page for every workflow that uses an LLM. Treat your prompts like code: pin a model+version, use SemVer for changes, add guardrails, monitor runs, and keep a rolling 30‑day review so provider deprecations don’t break delivery. Fill in [BRACKETS] and replace example text with your own. Keep this page tidy—contractors should be able to ship safely in under 10 minutes.
1) Page header — canonical metadata
Paste this block at the top of every SOP page.
- SOP Title: [SOP NAME]
- Owner: [OWNER NAME]
- Backup Owner: [BACKUP OWNER]
- Status: [DRAFT | APPROVED | DEPRECATED]
- SemVer: [MAJOR.MINOR.PATCH] (e.g., 1.3.2)
- Model Tag: [PROVIDER:MODEL:VERSION] (e.g., openai:gpt-4o:2026-01-20)
- Temperature Band: [LOW (0.0–0.2) | MED (0.3–0.6) | HIGH (0.7–1.0)]
- Dataset/Test Hash: [DATASET_SHA or N/A]
- Last Updated: [YYYY‑MM‑DD]
- Next Review (30‑day): [YYYY‑MM‑DD]
- Prompt Registry ID(s): [PROMPT_KEY(S)]
- Guardrails Policy ID+Version: [POLICY_ID@v]
- Monitoring — Trace: [TRACE_DASHBOARD_URL]
- Monitoring — Evals: [EVAL_DASHBOARD_URL]
- Cost Target: [$MAX_PER_RUN] | Latency Target: [MS]
- Rollback Target Version: [vMAJOR.MINOR.PATCH]
- Links: [DATASET LINK] · [CHANGE‑LOG DB] · [RUNBOOK]
Tip: In Notion, add database properties for each line. Use a Formula for Next Review: dateAdd(prop("Last edited time"), 30, "days").
2) Preconditions (green‑light check)
Complete before any run.
- Access: [TOOLS/ACCOUNTS NEEDED] (e.g., API keys, doc repo, CRM view)
- Data scope: [WHAT DATA CAN THE MODEL SEE?]
- Ground truth: [LINKS TO POLICIES / STYLE GUIDES / KB]
- Privacy: [PII HANDLED? YES/NO] · Anonymization: [ON/OFF]
- Guardrails attached: [POLICY_ID@v] (e.g., input/output filters, tool allow‑list)
- Test fixtures loaded: [YES/NO] · Dataset hash: [HASH]
- Failure budget: [MAX % ERROR or MAX ESCALATIONS]
- Approval path: [REVIEWER] for [RISKY STEPS]
- Rollback ready: [LAST GOOD VERSION] restored? [YES/NO]
3) Inputs — strict contracts (schemas + examples)
Define exactly what the model receives. Keep shapes stable across versions.
- Input Type: [NAME]
- Contract (JSON Schema):
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "[INPUT_NAME]",
"type": "object",
"required": ["[FIELD_1]", "[FIELD_2]"],
"properties": {
"[FIELD_1]": {"type": "string", "description": "[DESC]"},
"[FIELD_2]": {"type": "array", "items": {"type": "string"}},
"[OPTIONAL]": {"type": "number", "minimum": 0}
}
}
- Example payload:
{"[FIELD_1]":"[VALUE]","[FIELD_2]":["a","b"],"[OPTIONAL]":3}
4) Steps — with versioned prompts
Document every model call as a step with versioned prompts. If a step changes output shape or safety guarantees, bump MAJOR. If the step changes instructions without breaking outputs, bump MINOR. If only wording/typos/thresholds tweak outcomes, bump PATCH.
Step Template (copy for each step):
- Step #: [N] — [STEP NAME]
- Prompt Key: [PROMPT_KEY]
- Current Version: v[MAJOR.MINOR.PATCH]
- Model Tag: [PROVIDER:MODEL:VERSION]
- Parameters: temperature=[X], top_p=[X], max_tokens=[X]
- System Message:
[CONCISE ROLE + RULES]
- Output MUST be valid JSON matching the provided schema.
- Refuse if input contains [DISALLOWED] or asks for [SENSITIVE_ACTION].
- Use source‑of‑truth links: [KB_LINKS].
- User Template:
Task: [TASK]
Inputs (JSON):
```json
[INPUT_PAYLOAD]
Quality gates: [BULLETS]
- Expected Output Schema: [LINK TO SECTION 5]
- Validators: [REGEX/SCHEMA/EVAL RULES]
- Guardrails attached: [POLICY_ID@v] (e.g., PII scrub, tool allow‑list)
- Observability: Trace tag(s) = [SOP_NAME], [PROMPT_KEY], [MODEL_TAG]
- Fallback: [RULES WHEN STEP FAILS]
- Tests to run before publish: [N FIXTURES] passing ≥[THRESHOLD]%
5) Expected outputs — schemas + quality gates
State the exact shape of successful outputs and how you’ll check them.
- Output Type: [NAME]
- Contract (JSON Schema):
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "[OUTPUT_NAME]",
"type": "object",
"required": ["status", "data"],
"properties": {
"status": {"type": "string", "enum": ["ok", "needs_review", "refused"]},
"data": {"type": "object"},
"notes": {"type": "string"}
}
}
- Acceptance checks:
- JSON validates against schema
- Forbidden phrases not present: [LIST]
- Score on eval suite ≥ [THRESHOLD]
- Latency ≤ [MS] and cost ≤ [$]
- Human spot‑check sample size: [N] items per [M] days
6) Known failure modes (and mitigations)
List the predictable ways this can go wrong and how you’ll catch them.
- Injection/Goal drift → Mitigation: strong system message; strip/neutralize untrusted input; attach guardrails [POLICY_ID@v]
- Data leakage (PII/keys) → Mitigation: PII scrub rule; output filter; test fixtures with red‑team prompts
- Hallucinated policy → Mitigation: cite‑required with KB link; eval that fails on missing citations
- Tool misuse (unsafe actions) → Mitigation: tool allow‑list; require human approval for [ACTIONS]
- JSON breakage → Mitigation:
response_format=json; repair parser with retry and schema validation - Model churn regressions → Mitigation: pin [MODEL_TAG], monthly review, back‑compat fixtures
7) Guardrails — what’s enforced where
Attach and reference guardrails you operate.
- Input filters: [PII, secrets, URLs from unknown domains]
- Output filters: [PII removal, profanity, unsupported claims]
- Policy map: [DOC LINK] → [RULE NAME] → [ENFORCEMENT POINT]
- Tool allow‑list: [TOOLS AVAILABLE TO THE AGENT]
- Human‑in‑the‑loop: [WHEN] → [WHO APPROVES]
- Runtime IDs: Guardrail Policy [POLICY_ID@v] (record here)
Note: If you switch stacks later (e.g., different guardrail vendor), keep the same policy names and versions in this SOP so diffs stay readable.
8) Monitoring hooks — traces, evals, alerts
Make regressions easy to see and cheap to fix.
- Tracing: tag every call with [SOP_NAME], [PROMPT_KEY], [MODEL_TAG]
- Evals: [WEEKLY | NIGHTLY] run on [N] fixtures; track pass‑rate trend
- Alerts: error rate > [X%], eval drop > [Y pts], latency > [Z ms], cost/run > [$]
- Dashboards: [TRACE_DASHBOARD_URL] · [EVAL_DASHBOARD_URL]
- Sampling: human review [N] items per [M] days; rotate reviewers across time zones
- Post‑incident note: use the Revision Log entry template (Section 10)
9) Rollback plan — hotfix path
When something breaks or a provider change lands, do this in order.
- Freeze — set Status: [DEGRADED]. Route outputs to [SAFE MODE/HUMAN REVIEW].
- Identify — compare failing traces against last good [MODEL_TAG] + [PROMPT_VERSION].
- Roll back — set [PROMPT_KEY] to [ROLLBACK TARGET]. Verify on fixtures (≥[THRESHOLD]% pass). Set Status: [STABLE].
- Patch — create v[PATCH+1] with minimal fix. Rerun evals. If output shape changed, bump MINOR/MAJOR and update Section 5.
- Document — add a Revision Log entry (Section 10). Link traces/evals.
- Prevent — add a new fixture capturing the failure. Update failure modes if needed.
10) Revision log — single source of truth
Keep one row per change. If you use a Notion DB, make these columns.
- Date: [YYYY‑MM‑DD]
- Editor: [NAME]
- Prompt Key: [PROMPT_KEY]
- From → To: [vOLD] → [vNEW]
- Change Type: [MAJOR | MINOR | PATCH]
- Reason/Trigger: [BUG | PROVIDER DEPRECATION | POLICY UPDATE | COST]
- Tests Run: [N]/[TOTAL], Pass‑rate: [%]
- Model Tag: [PROVIDER:MODEL:VERSION]
- Links: [DIFF] · [TRACE VIEW] · [EVAL RUN]
- Rollback Plan Tested: [YES/NO]
- Notes: [WHAT WE LEARNED]
Example entry:
2026‑05‑08 | Kira | support_refund_triage | 1.2.3→1.2.4 | PATCH | JSON repair retry | 48/48 pass | openai:gpt‑4o:2026‑01‑20 | diff/123 · trace/q=refund · eval/456 | rollback tested | tightened refusal rule for out‑of‑scope asks
11) Naming + versioning convention (SemVer + model tag + dataset hash)
Use this everywhere names appear (page title, Prompt Key, datasets, commits).
- Prompt Key:
[TEAM]_[DOMAIN]_[TASK](lowercase, snake_case). Example:cs_refund_triage. - Version:
MAJOR.MINOR.PATCH- MAJOR: output shape/contract or guardrail guarantees change
- MINOR: instructions/constraints change; contract stable
- PATCH: small fixes; no behavioral intent change
- Model Tag:
[PROVIDER]:[MODEL]:[RELEASE_DATE](from vendor docs) - Dataset/Test Hash: first 8 of dataset SHA‑256 (e.g.,
ds:7fa91b2c) - Full label (recommended):
[PROMPT_KEY]@v[SEMVER]#[MODEL_TAG]+[DATASET_HASH]- Example:
cs_refund_triage@v1.2.4#openai:gpt‑4o:2026‑01‑20+ds:7fa91b2c
- Example:
Rule: Never change text without bumping a version. If you change the model, update the Model Tag and rerun evals before publish.
13) Standing 30‑day change‑control — runbook
Run this every 30 days or when a provider posts a deprecation/changelog item.
- Watchlist (record links in the SOP DB): [OPENAI DEPRECATIONS] · [AZURE OPENAI MODEL RETIREMENTS] · [ANTHROPIC MODEL DEPRECATIONS] · [VERTEX AI DEPRECATIONS] · [OPENAI CHANGELOG]
- Calendar: recurring “LLM Change‑Review” event → Owner
- Before the review:
- Export current pass‑rate, latency, and cost/run
- Check deprecations for your Model Tag(s)
- If a change affects you, create a migration plan and a target date
- During the review:
- Re‑run evals on [N] fixtures
- Spot‑check [M] live outputs
- Decide: [KEEP AS‑IS] | [PATCH/MINOR] | [MAJOR MIGRATION]
- After the review:
- Update Model Tag if needed; run full tests
- Add a Revision Log entry and update Last Updated (auto)
- Confirm Next Review date populated by Formula
14) Mini‑example — Automation (invoice categorization)
A filled example you can adapt. Goal: classify invoices into categories and flag outliers.
Header quickfill:
- SOP Title: Finance — Invoice Categorization
- Owner: [OWNER]
- SemVer: 1.0.0 · Model Tag: [PROVIDER:MODEL:VERSION]
- Temperature Band: LOW (0.1) · Dataset Hash: ds:1a2b3c4d
Inputs schema:
{
"type": "object",
"required": ["vendor","line_items","total","currency"],
"properties": {
"vendor": {"type":"string"},
"line_items": {"type":"array","items":{"type":"object","required":["desc","amount"],"properties":{"desc":{"type":"string"},"amount":{"type":"number"}}}},
"total": {"type":"number"},
"currency": {"type":"string","minLength":3,"maxLength":3}
}
}
Step 1 — fin_invoice_categorize v1.0.0
- System:
You are a strict accounting classifier. Map each line item to one of the allowed categories. Output valid JSON only.
Allowed categories: ["software","contractor","travel","office","other"].
Refuse if currency not provided.
- User:
Task: Categorize invoice line items.
Inputs (JSON):
```json
{"vendor":"Acme SaaS","line_items":[{"desc":"Pro plan","amount":49}],"total":49,"currency":"USD"}
Quality gates: must sum to total; categories must be from allow‑list.
- Output schema (excerpt):
```json
{"status":"ok","data":{"categories":[{"desc":"Pro plan","category":"software","amount":49}]},"notes":""}
- Validators: sum(line_items.amount) == total; categories in allow‑list
- Monitoring: Trace tag
fin_invoice_categorize@v1.0.0#[MODEL_TAG] - Failure modes: currency missing → refuse; unknown vendor → still categorize
15) Mini‑example — Content (podcast → LinkedIn post)
Turn a transcript into a concise LinkedIn post with a safety gate for claims.
Header quickfill:
- SOP Title: Content — Episode → LinkedIn Post
- Owner: [OWNER]
- SemVer: 1.1.0 · Model Tag: [PROVIDER:MODEL:VERSION]
- Temperature Band: MED (0.5) · Dataset Hash: ds:9e8d7c6b
Inputs schema (excerpt):
{"type":"object","required":["title","transcript","url"],"properties":{"title":{"type":"string"},"transcript":{"type":"string"},"url":{"type":"string"}}}
Step 1 — content_linkedin_post v1.1.0
- System:
You write plainspoken operator posts. 120–180 words. No emojis. Include a single numeric insight if present. Link at the end.
If a sentence includes an unverified claim, add "(source in comments)".
- User:
Task: Draft a LinkedIn post from this episode.
Inputs (JSON): {"title":"[EP TITLE]","transcript":"[RAW TEXT]","url":"[LP URL]"}
Quality gates: 1 idea, 1 number, clear CTA to template link.
- Output schema (excerpt):
{"status":"ok","data":{"post":"[TEXT]","cta":"[URL]"}}
- Guardrails: profanity filter; URL allow‑list
- Monitoring: sample 5 posts/month for human edit time < 3 minutes
16) Mini‑example — Support (refund triage)
Draft a safe first reply and route edge cases to a human.
Header quickfill:
- SOP Title: Support — Refund Triage
- Owner: [OWNER]
- SemVer: 1.2.4 · Model Tag: [PROVIDER:MODEL:VERSION]
- Temperature Band: LOW (0.2) · Dataset Hash: ds:7fa91b2c
Inputs schema (excerpt):
{"type":"object","required":["message","order_id","customer_tier"],"properties":{"message":{"type":"string"},"order_id":{"type":"string"},"customer_tier":{"type":"string","enum":["free","pro","enterprise"]}}}
Step 1 — cs_refund_triage v1.2.4
- System:
You are a policy‑bound support agent. Obey the refund policy. If request is outside policy or order is older than [DAYS], set status to "needs_review" and explain why.
- User:
Task: Draft a first reply following policy.
Inputs (JSON): {"message":"[TEXT]","order_id":"[ID]","customer_tier":"pro"}
Quality gates: policy link cited; tone neutral; no offers beyond policy.
- Output schema (excerpt):
{"status":"ok","data":{"reply":"[TEXT]","policy_links":["[URL]"]},"notes":"route to human if needs_review"}
- Guardrails: PII scrub; forbidden offers list; refusal template
- Fallback: set status "needs_review" if policy ambiguous