Kira: It was two AM in Mexico City. I'm half asleep, phone buzzes — missed call from a US number. Then another one. Then a WhatsApp message: "Hey, I tried calling about your AI content audit. Guess you're closed?"
Santi: Closed.
Kira: Closed. I don't close. I'm a one-person operation running async across three continents. But my phone number has business hours apparently, because I'm a human who sleeps.
Santi: How much was that lead worth?
Kira: If it converted? Probably a twenty-five-hundred-dollar engagement. Maybe more — she mentioned a team of twelve. But by the time I called back at eight AM, she'd already booked a discovery call with someone else. Someone whose phone answered at two AM.
Santi: Not someone. Something.
Kira: Right. An AI voice agent. She told me — "Oh, I called another agency and their system just... handled it. Booked me right in." And I'm lying there thinking, I lost a twenty-five-hundred-dollar client to a robot receptionist while I was asleep.
Santi: And you probably could have built that robot in a weekend.
Kira: That's the part that stung. Because I looked into it — the pricing changed at the end of March. OpenAI updated their Realtime API rates, and suddenly the math works for people like us. Solo operators. Small agencies. Not just enterprise call centers with six-figure budgets.
Santi: The math works if you scope it right. And that's the part everyone gets wrong — they try to build a general-purpose phone agent that handles everything, and it costs a fortune and sounds terrible. When all you actually need is five things.
Kira: So that's what we built. And today we're handing you the whole thing.
Santi: A five-intent AI voice agent — book, reschedule, qualify, deflect pricing questions, and take voicemails — with a spend guard that caps every call and every month so you never wake up to a surprise bill. Plus the consent scripts, the escalation logic, and the exact per-minute cost math so you know what this thing costs before your first caller dials in.
Santi: Okay so before we build anything — what's the actual damage from missed calls? Because I think people underestimate this.
Kira: They do. Run your own numbers. If you're a service business or an agency and you miss one qualified call per night — just one — and your average engagement is, say, fifteen hundred dollars...
Santi: That's forty-five thousand a year in leads that called you and got nothing.
Kira: And those aren't cold leads. Those are people who picked up a phone and dialed your number. That's the highest-intent action a prospect can take.
Santi: Right. So the question isn't "should I answer after-hours calls." The question is "what's the cheapest way to answer them without hiring a night shift."
Kira: Or without setting an alarm for three AM every time a prospect's in a different time zone.
Santi: Which, if you're a nomad, is always. So — the architecture. This is simpler than people think. You need three pieces. Twilio Programmable Voice handles the phone line — it receives the call and routes it to a webhook. That webhook connects to OpenAI's Realtime API, which is doing the actual conversation — listening, thinking, responding, all in one stream. And then the Realtime session makes function calls out to your calendar and your CRM to actually do things. Book a slot. Create a lead. Send a confirmation text.
Kira: Walk me through what the caller actually experiences though. Because "webhook to Realtime API" means nothing to the person dialing in.
Santi: Fair. So — phone rings. Twilio picks up instantly. The caller hears a greeting within about a second. Something like "Thanks for calling, this line uses AI to assist and is recorded for quality — do I have your permission to continue?" They say yes, and now they're talking to the agent. The agent figures out what they want — do they want to book, reschedule, ask about pricing, whatever — and handles it. If it can't figure out what they want, or if the caller sounds frustrated, it says "let me connect you to a person" and transfers the call.
Kira: And this is the important part — it only does five things. Not fifty. Not "whatever the caller asks." Five.
Santi: Five. Book an appointment. Reschedule an existing one. Qualify a new lead by asking a few targeted questions. Deflect pricing inquiries to your pricing page with a text link. And if it's after hours or something goes sideways, take a voicemail.
Kira: I want to push on the pricing deflection because I think that's where most people would try to get clever. You could train the agent to quote prices, right? Walk through packages?
Santi: You could. And you'd regret it.
Kira: Why?
Santi: Because pricing conversations are where AI voice agents sound the worst. The caller asks a follow-up, the agent hallucinates a discount that doesn't exist, and now you've got a prospect showing up to a call expecting a rate you never offered. The deflection is the smart move — "We publish our pricing online, I'll text you the link right now, and I can book a quick call if you want to talk specifics." Done. Three sentences. No hallucination risk.
Kira: And you've actually converted people off that deflection?
Santi: More than off the long pricing conversation, honestly. Because the text link arrives while they're still on the phone. They're looking at real numbers immediately. And the callback offer means they don't feel dismissed.
Kira: Okay so what does this actually cost per call? Because I've seen people in builder forums saying the Realtime API is still too expensive for small operators.
Santi: It depends on which model you run. OpenAI's Realtime API bills audio by tokens — one token per hundred milliseconds of input audio, one token per fifty milliseconds of output. So for the full model, gpt-realtime-one-point-five, you're looking at roughly nine-point-six cents per minute of audio. For the mini model — about three cents per minute.
Kira: Three cents a minute.
Santi: Audio only. You've still got text tokens on top of that, and telephony — Twilio charges per minute too, and that varies wildly by country. You have to pull your specific rates from Twilio's Pricing API at build time. Don't assume a flat global number.
Kira: So a typical call — what, three minutes? Four?
Santi: With five bounded intents and a max of eight turns, most calls land between two and four minutes. On the mini model, that's six to twelve cents of audio cost per call. Add your telephony minutes on top. For a US-to-US call, you're probably under thirty-five cents total.
Kira: Thirty-five cents to book a fifteen-hundred-dollar client.
Santi: And that's why the spend guard matters. You set a per-call cap — say thirty-five cents — and a monthly cap. If a call runs long or something weird happens, the system disconnects or escalates to a human before the bill spirals. The OpenAI docs actually explain why this matters — later turns in a Realtime session cost more because the entire conversation history gets re-sent with each response. So turn eight costs more than turn two.
Kira: Which is why you cap at eight turns.
Santi: Exactly. Eight turns handles booking, rescheduling, qualification — all of it. If someone needs more than eight turns, they need a human anyway.
Kira: Now — I know some people are going to hear "nine cents a minute" for the full model and think that's too rich. What's the alternative?
Santi: You split the stack. Separate speech-to-text, separate language model, separate text-to-speech. Google's Speech-to-Text runs about one-point-six cents per minute. Amazon Polly Neural for TTS is roughly one-point-two cents per minute. So your audio layer drops to under three cents combined — but now you're managing three services instead of one, you're handling the latency between them yourself, and your time-to-first-audio is going to be slower.
Kira: Which matters when someone's on the phone waiting for a response.
Santi: It matters a lot. The Realtime API handles the whole loop — listen, think, speak — in one stream. Sub-two-second response times. With a split stack, you're stitching that together yourself and praying the latency stays under two seconds.
Kira: So the tradeoff is: Realtime costs more per minute but ships faster and sounds better. Split stack costs less but you're building plumbing instead of building your business.
Santi: For a weekend build? Realtime on the mini model. Three cents a minute. Ship it, validate it, optimize later if volume demands it.
Kira: Alright — consent. This is the part that makes me nervous, and I think it should make everyone nervous. You're recording phone calls with an AI. That's legally sensitive territory.
Santi: It is. And I want to be clear — we're not lawyers, this isn't legal advice. But the baseline best practice, according to the Digital Media Law Project, is to announce recording and get acknowledgment before proceeding. Federal law in the US is one-party consent, but a bunch of states require all-party consent. So the safe default is: announce it, ask for permission, and if they say no, offer to text them a booking link or transfer to a human.
Kira: And that consent line is literally the first thing the agent says. Before anything else.
Santi: Before anything else. "This line uses AI to assist and is recorded for quality. Do I have your permission to continue?" If they say no, you pivot. If they say yes, you proceed. One retry if they don't respond clearly, then escalate.
Kira: What about do-not-call? If someone says "take me off your list"?
Santi: Immediate flag. The agent says "I've marked your number as do-not-contact, you won't receive further calls or texts from us." And your system tags that number in the CRM so it never gets dialed again.
Kira: And if you skip that step, you're not just being rude — you're potentially violating federal regulations.
Santi: Correct. The other guardrail people miss is language fallback. If your caller starts speaking Spanish and your agent only handles English, you need a graceful pivot. Not a crash. Not silence. A "puedo ayudar en español" or a transfer.
Kira: So does this actually work at scale? Because we're talking about a weekend build for nomad founders — but is there evidence that bounded voice agents drive real revenue?
Santi: The Melting Pot — the fondue restaurant chain — deployed a voice agent through PolyAI scoped to reservations. Create, modify, cancel, answer FAQs. That's it. Four intents. In six months, they reported two hundred and fifty thousand dollars in revenue from after-hours bookings alone. Sixty-eight percent of reservation calls fully automated.
Kira: Two hundred and fifty thousand. From calls that would have gone to voicemail.
Santi: From calls that would have gone to voicemail. And on the small business side, Goodcall published a case study — a junk removal company using their AI phone assistant saw a twenty-five percent monthly revenue lift and over twenty-five hundred dollars in bookings in thirty days. Just from answering the calls they were already missing.
Kira: And both of those are bounded agents. Not general-purpose "ask me anything" systems. Booking. Scheduling. Qualification. That's the pattern.
Santi: That's the pattern. You don't need your AI voice agent to discuss your company's origin story or debate pricing tiers. You need it to do five things perfectly and hand off everything else.
Kira: Last piece — how do you monitor this when you're in a different time zone every week?
Santi: Every call logs to your workspace — I use Notion, but Airtable works too. Call ID, duration, which intent fired, how many turns, estimated audio cost, whether it escalated, and the outcome — did it book, create a lead, take a voicemail, or bail. You build a simple dashboard that shows you booking rate, escalation rate, average cost per call, and average time-to-first-audio.
Kira: And you're checking that... when?
Santi: Monday mornings. Ten minutes. If your escalation rate spikes above twenty percent, something's wrong with your prompts or your intent triggers. If your average cost per call is creeping up, your conversations are running too long — tighten the turn cap or simplify the qualification questions. If booking rate drops, check your calendar integration.
Kira: Ten minutes a week to manage a system that's answering your phone twenty-four seven.
Santi: That passes the Lisbon Test. You can do that from a café with one bar of wifi.
Kira: You know what gets me? That lead I lost at two AM — she wasn't even a hard conversion. She wanted to buy. She called me. All I needed was something that picked up the phone and said "when works for you?"
Santi: And now you have it.
Kira: And now anyone listening has it. Five intents. Consent up front. Spend guard on every call. Escalation to a human the moment something goes sideways. That's the whole system. Not an open-ended agent that tries to be everything — a bounded concierge that does five things and does them at three AM while you're asleep in whatever city you woke up in this week.
Santi: The Weekend Voice Concierge Starter Pack is on the Resources page — call-flow JSON, the Make and Zapier blueprint, spend-guard script, consent copy, and the metrics dashboard. Everything we walked through today, ready to paste and deploy.
Kira: Your one homework assignment — before you touch any of that — pull your Twilio per-country rates using their Pricing API. Because your telephony costs depend entirely on where your callers are, and if you skip that step, your spend guard is guessing.
Santi: Do that first. Then ship the concierge. Then check your dashboard Monday morning.
Kira: That's it for this one. I'm Kira.
Santi: Santi. Go build something that answers the phone.