I PUT A PYTHON VOICE AGENT IN A CLOUDFLARE CONTAINER. TURN-TAKING WAS THE WHOLE FIGHT.
A phone rings. The caller talks. Transcribing them is the easy part. The fight is knowing when they’re done. I built a production phone agent that gets this right. A Pipecat pipeline running in a Cloudflare Container, a local turn-detection model loaded per call, scaled to zero the moment the line goes dead.
This is the heavy architecture. A real Python process. A real audio framework. A model pre-baked into the image. It buys you turn-taking that feels human, and it costs you a Docker image, a cold-start dance, and operational weight.
Here’s every millisecond I fought for. And here’s what Cloudflare and Deepgram have shipped since that turns half of it into a platform call.
Three runtimes, one phone call
A single call touches three runtimes handing audio and control signals to each other.
PSTN / cell
│
┌───▼────────┐ WebSocket (μ-law 8kHz)
│ Twilio │──────────────────────────────┐
│ Media │ │
│ Streams │ ▼
└────────────┘ ┌──────────────────────────────────┐
│ Cloudflare Worker (Durable │
│ Object: VoiceContainer) │
│ lifecycle + proxy + control │
└─────────────────┬────────────────┘
│ proxied WS / HTTP
┌──────────────────▼───────────────────┐
│ Cloudflare Container (Python 3.12) │
│ FastAPI + uvicorn │
│ 8080 = audio · 9090 = control │
│ Pipecat: in → Deepgram STT → │
│ HybridTurn → ElevenLabs TTS → out │
└───────────────────────────────────────┘
The split is the design. The Worker plus Durable Object is the always-cheap control plane. It owns the container’s lifecycle, proxies Twilio’s media WebSocket inward, and runs a separate control channel for the kill switch and turn events. It stays responsive even while audio I/O saturates the container.
The container is the body. A Python process importing Pipecat, holding a turn-detection model in memory, streaming audio. Containers bill on active CPU and scale to zero, so an idle agent costs nothing. The container has a per-call lifetime. The Worker boots it when a call needs to happen, lets it idle two minutes after the call ends, then SIGTERMs the process and scales to zero.
Pipecat is the orchestration framework. It models the call as a graph of frame processors and handles the fiddly real-time plumbing (serialization, interruptions, sample-rate conversion) so I don’t.
The container is the body, the Durable Object is the brainstem
VoiceContainer extends Container<Env>, bound in wrangler.toml.
[[containers]]
class_name = "VoiceContainer"
image = "./Dockerfile"
instance_type = "standard-4" # 4 vCPU / 12 GiB / 20 GB disk
max_instances = 1
sleepAfter = '2m' # scale to zero 2 minutes after the call
That max_instances = 1 was conservative when I wrote it. Custom instance types let you name an exact vCPU/memory/disk shape, bounded by standard-4 (4 vCPU, 12 GiB, 20 GB). CPU went active-usage-only billing late last year, so a container that’s booted but idle between audio frames isn’t burning vCPU-seconds the way it used to. The economics of “one container per call” make more sense now.
The interesting code is everything startCall does to survive a cold start.
await this.startAndWaitForPorts({
ports: [this.defaultPort, this.controlPort], // 8080, 9090
cancellationOptions: {
instanceGetTimeoutMS: 60_000,
portReadyTimeoutMS: 60_000, // default ~20s is too short
},
});
await this.waitForHealthy(config.callId); // 30s deadline, poll /health @ 400ms
Two cold-start lessons are baked in here. First, ports take longer than the default to bind. Importing Pipecat and loading the turn model pushes startup past the 20s default, so I raise it to 60s. Second, a bound port doesn’t mean the app is serving. The DO polls GET /health every 400ms until it returns {ok: true, ready: true}. Without it, Twilio connects and the caller starts talking before uvicorn is actually accepting the media stream. Only after /health passes and the control WebSocket connects does startCall return.
One more sharp edge. Worker secrets can’t be injected as container environment variables. The runtime rejects it. So the DO collects them and ships them in the body of an /init POST, and the Python side loads them into os.environ before any call work begins.
The codec trap nobody warns you about
This is the part I got wrong initially.
| Hop | Format |
|---|---|
| Twilio → container | μ-law (G.711 PCMU), 8 kHz, 20 ms frames (160 bytes) |
| Inside the pipeline (STT input) | linear PCM16, 16 kHz |
| TTS output → Twilio | μ-law, 8 kHz |
The trick is that you don’t hand-configure the codec anywhere. Pipecat’s TwilioFrameSerializer.deserialize() already converts incoming μ-law to PCM and emits frames at the pipeline’s sample rate. I learned this the expensive way. Forcing encoding="mulaw" / sample_rate=8000 on Deepgram made it misread the PCM bytes and produce no transcript at all. Not a bad transcript. None. The fix was to leave Deepgram on its defaults and let the serializer do its job.
On the output side, one line earns its keep.
task = PipelineTask(pipeline, params=PipelineParams(audio_out_sample_rate=8000))
The default is 24 kHz. Forcing 8 kHz means ElevenLabs streams pcm_8000 (one-third the bytes) so the first audio byte arrives faster, and Twilio’s normal 24k→8k resample becomes a no-op. Twilio plays back at 8 kHz anyway, so there’s no real fidelity loss. STT input stays at 16 kHz on purpose, to protect transcription quality.
Turn-taking is the whole game
Real conversation isn’t “wait for silence, then reply.” turn_processor.py is the most interesting file in the project, and it’s where a voicebot becomes something that feels like a person.
End-of-turn detection uses a local model, not just silence. Smart-Turn makes a semantic judgment. Is the speaker pausing to think, or actually done?
def load_smart_turn():
from pipecat.audio.turn.smart_turn.local_smart_turn_v3 import LocalSmartTurnAnalyzerV3
from pipecat.audio.turn.smart_turn.base_smart_turn import SmartTurnParams
return LocalSmartTurnAnalyzerV3(params=SmartTurnParams(stop_secs=0.7))
It runs on CPU at ~65 ms per inference on my standard-4 container, only that fast because the weights are pre-fetched at Docker build time (more on that below). Pipecat’s own benchmark for Smart-Turn v3 is 12 ms on a modern CPU. My number is higher because a shared container core isn’t a benchmark rig, but it’s well inside the budget. The model is small. Whisper Tiny base, a linear classifier head, about 8M parameters. Since I shipped this, a v3.1 dropped with meaningfully better accuracy on English and Spanish, so re-baking the image is a free upgrade. If the model fails to load, the code silently falls back to Silero VAD. stop_secs=0.7 keeps replies snappy. The 3.0s default felt sluggish.
Even with a turn model, the processor adds a text-level safety net, commit now versus hold for more.
HOLD_SECONDS = 1.8 # max silence before committing a trailing-off turn
def _looks_complete(text: str) -> bool:
t = (text or "").strip()
if not t: return True
last = re.sub(r"[^a-z']", "", t.split()[-1].lower())
if last in _NEVER_FINAL: # "and", "or", "but", "the", "um", ...
return False
return bool(_TERMINAL_PUNCT.search(t)) # ends with . ! ?
Ends in . ! ? → commit, respond now. Trails off on “…and” or filler → hold up to 1.8s, and if the caller resumes, merge the held text with the new final instead of firing two turns. That 1.8s has a tuning history written in the comments. 1.5 → 3.0 (cut paced callers off) → 1.2 (too aggressive) → 1.8 (the sweet spot). You don’t guess these numbers. You earn them on real calls.
Barge-in fires on an interim transcript, ~100 to 300 ms in, not on the final, so the cutoff feels instant.
if (self._agent_speaking() and not self._interrupted
and not self._opener_active and not _is_backchannel(frame.text)):
self._interrupted = True
await self.broadcast_interruption() # flush TTS + Twilio output
if self._current_turn and not self._current_turn.done():
self._current_turn.cancel() # kill the in-flight LLM stream
A backchannel set ({"mm", "mhm", "yeah", "right", "ok", "gotcha", ...}) is excluded. Those mean “I’m listening,” not “stop talking,” and get dropped while the agent speaks.
There’s more in here than I can fit. An opener gate that swallows the caller’s “hello?” while the greeting plays, then answers it after a 2.5s grace window. A 6s dead-air timer that nudges (“Still there? No rush”). A single-threaded asyncio turn worker pulling from an unbounded queue, so audio never blocks on the LLM and a turn that throws gets logged and skipped instead of crashing the call. And a small delight. A regex that respells “555-555-5555” digit-by-digit so TTS doesn’t say “five billion,” while leaving prices and years alone.
Here’s the currency twist. I built this entire turn-taking apparatus on top of Deepgram nova-3, which is a plain transcription model. It tells you what was said, not whether the turn is over. This spring Deepgram shipped Flux, a conversational speech-recognition model with model-integrated end-of-turn detection built in. Accurate turn decisions in under 400 ms, at the STT layer. That’s a chunk of my HybridTurnProcessor collapsed into the transcriber. I haven’t swapped it in (I like owning the hold-and-merge and backchannel logic) but the next person building this gets turn detection for free in the same WebSocket that gives them text.
The LLM is a streaming fallback ladder
The LLM runs through Cloudflare AI Gateway with a provider ladder.
_SUBSTANTIVE_LADDER = [
("anthropic", "claude-haiku-4-5-20251001"),
("openai", "gpt-4o-mini"),
("workers-ai", "llama-3.3-70b"),
]
Primary is Claude Haiku 4.5, still Anthropic’s fast/cheap workhorse as of today, so no change there. On failure it drops to GPT-4o-mini, then Llama 3.3 70B as a last resort. Each fallback drops a Sentry breadcrumb so degraded calls are traceable. Every request sends cf-aig-collect-log-payload: false so transcripts aren’t persisted in the Gateway, and the HTTP client has an 8s timeout. A timeout just trips the ladder.
The primary provider is streamed, so TTS starts mid-reply.
async for delta in transport.post_stream(primary, ...):
if first and delta:
first = False
await on_first_token() # latency marker
buf += delta
buf = await _emit_sentences(buf, _say) # speak each finished sentence
A sentence splitter emits each complete sentence as it forms, and TTS begins speaking sentence 1 while the LLM is still writing sentence 2. The fallback rule is careful. If streaming fails before any audio was spoken, drop to the non-streaming ladder. If it fails after the agent already started talking, raise instead. Never double-speak.
Control plane is best-effort, audio lives
Audio runs on port 8080. A separate control channel runs on port 9090, so safety signals never compete with audio. Every turn, before generating a reply, the processor polls a kill switch.
action = await self.control.poll_killswitch(self.call_id, ti)
if action != "none":
if action != "hard":
await self.speak(_CLOSING_LINE) # warm = graceful goodbye line
return # skip the LLM call entirely
none → continue. warm → speak a closing line, then stop. hard → stop immediately. Everything here is best-effort. The poll has a 1.5s timeout, and on timeout or any transport error it returns "none" and the call proceeds. If the control channel never connects, the pipeline runs with a NullControl() where every operation is a safe no-op. Turn events fire off the critical path with asyncio.create_task. Awaiting them added ~0.13s in front of every spoken reply.
Everything is latency-obsessed, everything degrades safely
The whole system is two ideas applied relentlessly. Shave milliseconds everywhere. Never let a non-audio failure kill the call.
| Guard | Value | Purpose |
|---|---|---|
| Port-ready timeout | 60 s | Cold start, Pipecat import + model load |
| Health poll | 30 s / 400 ms | Confirm uvicorn is truly serving |
Smart-Turn stop_secs | 0.7 s | Snappy end-of-turn |
HOLD_SECONDS | 1.8 s | Merge trailing-off speech |
| Kill-switch poll | 1.5 s → none | Never block on control plane |
| LLM request | 8 s → ladder | Bound provider latency |
| Call TTL | 900 s | Hard call cap |
sleepAfter | 2 min | Scale container to zero |
And the degradation ladder. Turn model fails to load → Silero VAD. LLM provider fails → next rung of the ladder. Control channel fails → NullControl. Audio always lives. The one non-obvious line in the whole Dockerfile is the model pre-fetch. Baking the Smart-Turn weights into the image at build time is what makes a local turn model viable at conversational latency. PYTHONUNBUFFERED=1 matters too, because the container lives seconds-to-minutes and unbuffered stdout is the only way logs reliably surface in wrangler tail.
What shipped underneath me
When I started, this heavy architecture was the only way to get a local turn model and a real audio framework onto Cloudflare. That’s no longer true. At Agents Week this April, Cloudflare shipped Realtime Agents. Workers AI now hosts Deepgram Nova-3, Flux, Aura, Whisper-large-v3-turbo and Llama 3.3 70B with WebSocket inference, paired with Cloudflare Realtime for WebRTC transport, and it launched with @cf/pipecat-ai/smart-turn-v2 running on the edge. Voice-to-voice round trips land in the ~800 ms range when every hop terminates in the same data center.
Read that against my stack. A turn model on the edge. STT and TTS on Workers AI. A runtime that orchestrates the pipeline. That’s the hard half of what I’m running in a Python container, now offered as platform primitives. No image, no cold-start choreography, no weights to pre-fetch.
I’m not tearing my caller down. I own the codec path, the hold-and-merge tuning, the opener gate, the backchannel filter, and owning them is the point when you need them to behave exactly your way. But the gap between “I built the whole pipeline” and “I called the runtime” is closing fast.
The point
A real framework plus a real model buys you real turn-taking. Smart-Turn for semantic end-of-turn, sentence-streamed TTS, backchannel filtering, opener gating, merge-on-trailing-off. That’s the difference between a voicebot and something that feels like a person, and Pipecat’s frame graph makes the orchestration tractable. Containers plus Durable Objects gives you a clean control/data split. The DO is the cheap, always-up brainstem. The container is the per-call body that scales to zero. The cost is operational weight. A Docker image, a Python runtime, cold-start choreography, a model to pre-fetch. That cost is shrinking. Container limits jumped 15x, custom instance types arrived, billing got cheaper, and the platform is absorbing turn detection at both the STT layer (Deepgram Flux) and the runtime layer (Cloudflare Realtime Agents). The heavy bet still wins when you need to own every millisecond. It just stopped being the only bet. We have options now.
This is the first teardown in the voice-agent series, the heavy architecture, with a dedicated audio framework and a local turn model. The next one builds the same phone agent without the container, purely on the Cloudflare edge. Stay tuned, that’s part 2 of 3 in the series.