Episode 9·April 20, 2026

Ship a Weekend Voice Concierge: 5 Intents, Spend Guards, and Zero Surprises

Spotify Apple Podcasts RSS Feed Open Companion Resource

Intro

This episode is for nomad founders and agencies who are tired of missing high-intent leads because they're asleep in different time zones. You'll get a complete blueprint for a bounded voice concierge that handles five specific intents with bulletproof guardrails, plus the exact cost math and compliance scripts to deploy safely this weekend.

In This Episode

Kira opens with a painful story: losing a $2,500 lead at 2 AM in Mexico City because her phone had "business hours." Santi explains why OpenAI's March 2026 Realtime API pricing changes make bounded voice agents financially viable for solo operators, then walks through the three-piece architecture: Twilio Programmable Voice routing to OpenAI Realtime with function calls to your calendar and CRM. They scope the agent to exactly five intents — book, reschedule, qualify, deflect pricing, and voicemail — and build in strict guardrails including consent scripts, spend caps, and automatic escalation to humans when confidence drops below 60%. The episode includes real case studies from The Melting Pot ($250k in after-hours bookings) and small businesses seeing 25% revenue lifts, plus a complete cost breakdown comparing Realtime API to split STT/TTS stacks.

Key Takeaways

A five-intent voice concierge costs roughly 35 cents per call on OpenAI's mini model and can capture thousands in missed leads without requiring you to answer your phone at 3 AM
Deflecting pricing questions to a text link with callback offer converts better than trying to train the agent to quote packages, eliminating hallucination risks while keeping prospects engaged
The Melting Pot generated $250k in after-hours bookings in six months using a bounded agent with just four intents, proving that simple scope beats complex general-purpose agents

Timestamps

Companion Resource

template

Weekend Voice Concierge Starter Pack (5‑Intent Blueprint + Spend Guard)

Copy‑paste templates to deploy a five‑intent, after‑hours voice concierge with strict handoffs and a spend guard. Includes call‑flow JSON, Make/Zapier wiring, consent/escalation copy, and a metrics dashboard you can use on day one.

OpenAI API Pricing
developers.openai.com
- - OpenAI lists gpt-realtime-1.5 audio pricing at $32 per 1M input audio tokens and $64 per 1M output audio tokens; gpt-realtime-mini is $10 (input) and $20 (output) per 1M audio tokens.
OpenAI Docs – Managing costs (Realtime)
developers.openai.com
- - OpenAI’s Realtime cost model defines audio tokenization as 1 token per 100 ms for user (input) audio and 1 token per 50 ms for assistant (output) audio.
OpenAI API Pricing + Managing costs
developers.openai.com
- - Derived per‑minute estimate for gpt-realtime-1.5 audio: input ≈$0.0192/min (600 tokens/min × $32/1M); output ≈$0.0768/min (1200 tokens/min × $64/1M); combined ≈$0.096/min excluding text tokens.
OpenAI API Pricing — Transcription models
developers.openai.com
- - OpenAI lists gpt‑4o‑transcribe with an estimated cost of $0.006 per minute for transcription (non‑Realtime).
AWS Amazon Polly Pricing
aws.amazon.com
- - Amazon Polly pricing: Standard voices $4/1M chars, Neural $16/1M chars; Amazon’s own examples equate 1M characters to ~23h 8m (~720 chars/min).
Derived from AWS Amazon Polly examples
aws.amazon.com
- - Using Amazon’s 1M chars ≈ 23h08m reference, approximate Neural TTS is ≈$0.0115/min ($16/1,388 min), and Standard ≈$0.0029/min.
ElevenLabs API Pricing
elevenlabs.io
- - ElevenLabs API lists pay‑as‑you‑go TTS at $0.05 per 1K characters (Flash/Turbo) and $0.10 per 1K (Multilingual v2/v3).
Derived from ElevenLabs pricing + AWS char/min example
elevenlabs.io
- - Using AWS’s ~720 chars/min yardstick, ElevenLabs Flash/Turbo ≈$0.036/min and Multilingual v2/v3 ≈$0.072/min (TTS only).
Google Cloud Speech‑to‑Text Pricing
cloud.google.com
- - Google Speech‑to‑Text V2 pricing (Standard recognition): $0.016/min up to 500k minutes per month, with tiered discounts at higher volumes.
Twilio Voice Pricing (US) and Voice Pricing API docs
twilio.com
- - Twilio’s Programmable Voice rates vary by country, direction, and number type; Twilio provides a Pricing API to fetch account‑specific, real‑time voice minute rates by country.
PulseSignal — OpenAI Price Change History
getpulsesignal.com
- - OpenAI changelog and third‑party price trackers confirm recent pricing page updates; PulseSignal shows OpenAI pricing page last verified Apr 12, 2026 and notes items with effective changes around Mar 31, 2026.
Digital Media Law Project (Recording Phone Calls and Conversations)
dmlp.org
- - Recording consent laws vary; federal law is one‑party consent, but some U.S. states require all‑party consent. Best practice for recorded business calls is to announce recording and proceed only after notice/consent.
PolyAI customer story: The Melting Pot
poly.ai
- - The Melting Pot (US fondue franchise) deployed a PolyAI voice agent integrated with OpenTable across locations
- - Shows a bounded, bookings-first voice concierge driving real revenue and handling core intents (create/modify/cancel reservations; FAQs) after hours.
Goodcall case study (vendor)
goodcall.com
- - Bye Junk and multiple SMBs using Goodcall AI phone assistant
- - Demonstrates that simple, bounded intents (answering, booking, basic Q&A) can quickly translate into measurable revenue for small service businesses.
PolyAI Restaurants page (case study index)
poly.ai
- - Restaurant sector outcomes across brands (Fogo de Chão, Côte Brasserie, Big Table Group)
- - Corroborates that well-scoped booking/FAQ flows can achieve high automation and material revenue; supports the five-intent scope for this episode.

Kira: It was two AM in Mexico City. I'm half asleep, phone buzzes — missed call from a US number. Then another one. Then a WhatsApp message: "Hey, I tried calling about your AI content audit. Guess you're closed?"

Santi: Closed.

Kira: Closed. I don't close. I'm a one-person operation running async across three continents. But my phone number has business hours apparently, because I'm a human who sleeps.

Santi: How much was that lead worth?

Kira: If it converted? Probably a twenty-five-hundred-dollar engagement. Maybe more — she mentioned a team of twelve. But by the time I called back at eight AM, she'd already booked a discovery call with someone else. Someone whose phone answered at two AM.

Santi: Not someone. Something.

Kira: Right. An AI voice agent. She told me — "Oh, I called another agency and their system just... handled it. Booked me right in." And I'm lying there thinking, I lost a twenty-five-hundred-dollar client to a robot receptionist while I was asleep.

Santi: And you probably could have built that robot in a weekend.

Kira: That's the part that stung. Because I looked into it — the pricing changed at the end of March. OpenAI updated their Realtime API rates, and suddenly the math works for people like us. Solo operators. Small agencies. Not just enterprise call centers with six-figure budgets.

Santi: The math works if you scope it right. And that's the part everyone gets wrong — they try to build a general-purpose phone agent that handles everything, and it costs a fortune and sounds terrible. When all you actually need is five things.

Kira: So that's what we built. And today we're handing you the whole thing.

Santi: A five-intent AI voice agent — book, reschedule, qualify, deflect pricing questions, and take voicemails — with a spend guard that caps every call and every month so you never wake up to a surprise bill. Plus the consent scripts, the escalation logic, and the exact per-minute cost math so you know what this thing costs before your first caller dials in.

Santi: Okay so before we build anything — what's the actual damage from missed calls? Because I think people underestimate this.

Kira: They do. Run your own numbers. If you're a service business or an agency and you miss one qualified call per night — just one — and your average engagement is, say, fifteen hundred dollars...

Santi: That's forty-five thousand a year in leads that called you and got nothing.

Kira: And those aren't cold leads. Those are people who picked up a phone and dialed your number. That's the highest-intent action a prospect can take.

Santi: Right. So the question isn't "should I answer after-hours calls." The question is "what's the cheapest way to answer them without hiring a night shift."

Kira: Or without setting an alarm for three AM every time a prospect's in a different time zone.

Santi: Which, if you're a nomad, is always. So — the architecture. This is simpler than people think. You need three pieces. Twilio Programmable Voice handles the phone line — it receives the call and routes it to a webhook. That webhook connects to OpenAI's Realtime API, which is doing the actual conversation — listening, thinking, responding, all in one stream. And then the Realtime session makes function calls out to your calendar and your CRM to actually do things. Book a slot. Create a lead. Send a confirmation text.

Kira: Walk me through what the caller actually experiences though. Because "webhook to Realtime API" means nothing to the person dialing in.

Santi: Fair. So — phone rings. Twilio picks up instantly. The caller hears a greeting within about a second. Something like "Thanks for calling, this line uses AI to assist and is recorded for quality — do I have your permission to continue?" They say yes, and now they're talking to the agent. The agent figures out what they want — do they want to book, reschedule, ask about pricing, whatever — and handles it. If it can't figure out what they want, or if the caller sounds frustrated, it says "let me connect you to a person" and transfers the call.

Kira: And this is the important part — it only does five things. Not fifty. Not "whatever the caller asks." Five.

Santi: Five. Book an appointment. Reschedule an existing one. Qualify a new lead by asking a few targeted questions. Deflect pricing inquiries to your pricing page with a text link. And if it's after hours or something goes sideways, take a voicemail.

Kira: I want to push on the pricing deflection because I think that's where most people would try to get clever. You could train the agent to quote prices, right? Walk through packages?

Santi: You could. And you'd regret it.

Kira: Why?

Santi: Because pricing conversations are where AI voice agents sound the worst. The caller asks a follow-up, the agent hallucinates a discount that doesn't exist, and now you've got a prospect showing up to a call expecting a rate you never offered. The deflection is the smart move — "We publish our pricing online, I'll text you the link right now, and I can book a quick call if you want to talk specifics." Done. Three sentences. No hallucination risk.

Kira: And you've actually converted people off that deflection?

Santi: More than off the long pricing conversation, honestly. Because the text link arrives while they're still on the phone. They're looking at real numbers immediately. And the callback offer means they don't feel dismissed.

Kira: Okay so what does this actually cost per call? Because I've seen people in builder forums saying the Realtime API is still too expensive for small operators.

Santi: It depends on which model you run. OpenAI's Realtime API bills audio by tokens — one token per hundred milliseconds of input audio, one token per fifty milliseconds of output. So for the full model, gpt-realtime-one-point-five, you're looking at roughly nine-point-six cents per minute of audio. For the mini model — about three cents per minute.

Kira: Three cents a minute.

Santi: Audio only. You've still got text tokens on top of that, and telephony — Twilio charges per minute too, and that varies wildly by country. You have to pull your specific rates from Twilio's Pricing API at build time. Don't assume a flat global number.

Kira: So a typical call — what, three minutes? Four?

Santi: With five bounded intents and a max of eight turns, most calls land between two and four minutes. On the mini model, that's six to twelve cents of audio cost per call. Add your telephony minutes on top. For a US-to-US call, you're probably under thirty-five cents total.

Kira: Thirty-five cents to book a fifteen-hundred-dollar client.

Santi: And that's why the spend guard matters. You set a per-call cap — say thirty-five cents — and a monthly cap. If a call runs long or something weird happens, the system disconnects or escalates to a human before the bill spirals. The OpenAI docs actually explain why this matters — later turns in a Realtime session cost more because the entire conversation history gets re-sent with each response. So turn eight costs more than turn two.

Kira: Which is why you cap at eight turns.

Santi: Exactly. Eight turns handles booking, rescheduling, qualification — all of it. If someone needs more than eight turns, they need a human anyway.

Kira: Now — I know some people are going to hear "nine cents a minute" for the full model and think that's too rich. What's the alternative?

Santi: You split the stack. Separate speech-to-text, separate language model, separate text-to-speech. Google's Speech-to-Text runs about one-point-six cents per minute. Amazon Polly Neural for TTS is roughly one-point-two cents per minute. So your audio layer drops to under three cents combined — but now you're managing three services instead of one, you're handling the latency between them yourself, and your time-to-first-audio is going to be slower.

Kira: Which matters when someone's on the phone waiting for a response.

Santi: It matters a lot. The Realtime API handles the whole loop — listen, think, speak — in one stream. Sub-two-second response times. With a split stack, you're stitching that together yourself and praying the latency stays under two seconds.

Kira: So the tradeoff is: Realtime costs more per minute but ships faster and sounds better. Split stack costs less but you're building plumbing instead of building your business.

Santi: For a weekend build? Realtime on the mini model. Three cents a minute. Ship it, validate it, optimize later if volume demands it.

Kira: Alright — consent. This is the part that makes me nervous, and I think it should make everyone nervous. You're recording phone calls with an AI. That's legally sensitive territory.

Santi: It is. And I want to be clear — we're not lawyers, this isn't legal advice. But the baseline best practice, according to the Digital Media Law Project, is to announce recording and get acknowledgment before proceeding. Federal law in the US is one-party consent, but a bunch of states require all-party consent. So the safe default is: announce it, ask for permission, and if they say no, offer to text them a booking link or transfer to a human.

Kira: And that consent line is literally the first thing the agent says. Before anything else.

Santi: Before anything else. "This line uses AI to assist and is recorded for quality. Do I have your permission to continue?" If they say no, you pivot. If they say yes, you proceed. One retry if they don't respond clearly, then escalate.

Kira: What about do-not-call? If someone says "take me off your list"?

Santi: Immediate flag. The agent says "I've marked your number as do-not-contact, you won't receive further calls or texts from us." And your system tags that number in the CRM so it never gets dialed again.

Kira: And if you skip that step, you're not just being rude — you're potentially violating federal regulations.

Santi: Correct. The other guardrail people miss is language fallback. If your caller starts speaking Spanish and your agent only handles English, you need a graceful pivot. Not a crash. Not silence. A "puedo ayudar en español" or a transfer.

Kira: So does this actually work at scale? Because we're talking about a weekend build for nomad founders — but is there evidence that bounded voice agents drive real revenue?

Santi: The Melting Pot — the fondue restaurant chain — deployed a voice agent through PolyAI scoped to reservations. Create, modify, cancel, answer FAQs. That's it. Four intents. In six months, they reported two hundred and fifty thousand dollars in revenue from after-hours bookings alone. Sixty-eight percent of reservation calls fully automated.

Kira: Two hundred and fifty thousand. From calls that would have gone to voicemail.

Santi: From calls that would have gone to voicemail. And on the small business side, Goodcall published a case study — a junk removal company using their AI phone assistant saw a twenty-five percent monthly revenue lift and over twenty-five hundred dollars in bookings in thirty days. Just from answering the calls they were already missing.

Kira: And both of those are bounded agents. Not general-purpose "ask me anything" systems. Booking. Scheduling. Qualification. That's the pattern.

Santi: That's the pattern. You don't need your AI voice agent to discuss your company's origin story or debate pricing tiers. You need it to do five things perfectly and hand off everything else.

Kira: Last piece — how do you monitor this when you're in a different time zone every week?

Santi: Every call logs to your workspace — I use Notion, but Airtable works too. Call ID, duration, which intent fired, how many turns, estimated audio cost, whether it escalated, and the outcome — did it book, create a lead, take a voicemail, or bail. You build a simple dashboard that shows you booking rate, escalation rate, average cost per call, and average time-to-first-audio.

Kira: And you're checking that... when?

Santi: Monday mornings. Ten minutes. If your escalation rate spikes above twenty percent, something's wrong with your prompts or your intent triggers. If your average cost per call is creeping up, your conversations are running too long — tighten the turn cap or simplify the qualification questions. If booking rate drops, check your calendar integration.

Kira: Ten minutes a week to manage a system that's answering your phone twenty-four seven.

Santi: That passes the Lisbon Test. You can do that from a café with one bar of wifi.

Kira: You know what gets me? That lead I lost at two AM — she wasn't even a hard conversion. She wanted to buy. She called me. All I needed was something that picked up the phone and said "when works for you?"

Santi: And now you have it.

Kira: And now anyone listening has it. Five intents. Consent up front. Spend guard on every call. Escalation to a human the moment something goes sideways. That's the whole system. Not an open-ended agent that tries to be everything — a bounded concierge that does five things and does them at three AM while you're asleep in whatever city you woke up in this week.

Santi: The Weekend Voice Concierge Starter Pack is on the Resources page — call-flow JSON, the Make and Zapier blueprint, spend-guard script, consent copy, and the metrics dashboard. Everything we walked through today, ready to paste and deploy.

Kira: Your one homework assignment — before you touch any of that — pull your Twilio per-country rates using their Pricing API. Because your telephony costs depend entirely on where your callers are, and if you skip that step, your spend guard is guessing.

Santi: Do that first. Then ship the concierge. Then check your dashboard Monday morning.

Kira: That's it for this one. I'm Kira.

Santi: Santi. Go build something that answers the phone.

AI voice agentOpenAI Realtime APITwilionomad businesscall automationlead captureweekend buildspend guardscompliancebooking automation