How We Built a $0.02/min Voice AI Agent
When we set out to build Gabster's voice AI capabilities, we had one goal: deliver the same quality as premium voice platforms at a fraction of the cost. Most competitors charge $0.10-0.15 per minute. We wanted to hit $0.02.
This is the technical story of how we achieved it.
The Problem with Existing Voice AI Stacks
Most voice AI platforms are assembled from expensive, best-in-class components:
- Twilio for telephony - $0.013/min just for the phone connection
- OpenAI Whisper API for speech-to-text - $0.006/min
- GPT-4 for reasoning - variable but expensive
- ElevenLabs for text-to-speech - $0.18 per 1,000 characters
Before any margin, these components alone cost $0.08-0.12 per minute. Add infrastructure, monitoring, and a healthy margin, and you're easily at $0.14+/min.
We asked: what if we optimized each layer from first principles?
Layer 1: Telephony with Telnyx
The first win came from choosing Telnyx over Twilio. While Twilio dominates the market, Telnyx offers comparable quality at dramatically lower prices.
| Provider | Voice API | Notes |
|---|---|---|
| Twilio | $0.013/min | Industry standard |
| Telnyx | $0.002/min | Owns their network |
The difference? Telnyx owns their own global carrier network instead of leasing capacity. This vertical integration translates directly to lower costs.
From a developer experience standpoint, Telnyx is equally good - maybe better. Their WebRTC and SIP support is excellent, and their documentation is comprehensive.
// Telnyx call handling is straightforward
app.post('/voice/incoming', async (req, res) => {
const call = await telnyx.calls.create({
connection_id: process.env.TELNYX_CONNECTION_ID,
to: req.body.to,
from: process.env.TELNYX_PHONE_NUMBER,
webhook_url: 'https://api.gabster.link/voice/events'
});
}); Layer 2: Speech-to-Text on Cloudflare
For speech recognition, we use Cloudflare Workers AI with Deepgram's nova-3 model. This is a relatively new offering that provides fast, accurate transcription at edge locations worldwide.
Cost: $0.0052 per minute of audio
The key advantage isn't just price - it's latency. Because Cloudflare runs inference at edge locations, the audio doesn't need to travel to a central data center. This reduces round-trip latency, which is critical for natural-sounding conversations.
// Transcribe audio chunk using Workers AI
const transcription = await env.AI.run(
'@cf/deepgram/nova-3',
{
audio: audioBuffer,
language: 'en'
}
); Layer 3: AI Reasoning with Llama 3.3 70B
For the "thinking" part of the voice agent, we use Meta's Llama 3.3 70B model running on Cloudflare Workers AI. This is where things get interesting.
GPT-4 would give us slightly better reasoning, but at 10-20x the cost. For voice agents, the difference is rarely noticeable because:
- Conversations are inherently simpler than complex text tasks
- Context windows are smaller (you're not processing documents)
- Speed matters more than perfect accuracy
Llama 3.3 70B hits the sweet spot. It's fast (critical for voice latency), cost-effective, and more than capable of handling customer support conversations.
Cost: ~$0.0015 per minute of conversation
const response = await env.AI.run(
'@cf/meta/llama-3.3-70b-instruct-fp8-fast',
{
messages: [
{ role: 'system', content: agentSystemPrompt },
...conversationHistory,
{ role: 'user', content: transcribedText }
],
stream: true,
max_tokens: 256 // Voice responses should be concise
}
); Layer 4: Turn Detection
One of the trickiest parts of voice AI is knowing when the user has finished speaking. Too aggressive, and you interrupt them. Too passive, and there are awkward silences.
We use Cloudflare's smart-turn-v2 model, which is specifically trained to detect conversation turn-taking patterns.
Cost: $0.0003 per minute
This tiny model runs continuously during the call, analyzing audio features to predict when the user is done speaking. It considers:
- Silence duration
- Pitch patterns (voices typically fall at the end of sentences)
- Speech rate changes
Layer 5: Text-to-Speech with Deepgram Aura
For converting the AI's response back to speech, we use Deepgram's Aura model via Cloudflare.
Cost: $0.015 per 1,000 characters (roughly $0.011 per minute)
The quality is excellent - natural-sounding, with good prosody. We stream the audio as it's generated, so the user starts hearing the response while the AI is still generating the rest.
// Stream TTS as it's generated
const audioStream = await env.AI.run(
'@cf/deepgram/aura-1',
{
text: aiResponse,
voice: 'asteria', // Female voice
model: 'aura-asteria-en'
}
);
// Pipe directly to Telnyx call
await telnyx.calls.speak(callId, audioStream); The Full Stack: Cost Breakdown
When you add it all up:
| Component | Provider | Cost/min |
|---|---|---|
| Telephony | Telnyx | $0.002 |
| Speech-to-Text | Cloudflare (Deepgram) | $0.0052 |
| AI Reasoning | Cloudflare (Llama 3.3) | $0.0015 |
| Turn Detection | Cloudflare | $0.0003 |
| Text-to-Speech | Cloudflare (Deepgram) | $0.011 |
| Total | ~$0.018/min |
Add a small margin for infrastructure and we're at ~$0.02 per minute.
Compare this to the $0.14+ competitors charge. That's a 7x cost reduction with no compromise in quality.
Architecture: Putting It Together
The full architecture runs entirely on Cloudflare's edge network:
Phone Call (Telnyx)
↓
Cloudflare Worker (Edge)
├── STT: Deepgram nova-3
├── AI: Llama 3.3 70B
├── Turn Detection: smart-turn-v2
└── TTS: Deepgram Aura
↓
Phone Call (Telnyx) Everything runs at the edge location closest to the caller. There's no central server bottleneck. This architecture naturally scales - Cloudflare handles thousands of concurrent calls without any capacity planning on our part.
Latency Considerations
For voice AI, latency is everything. Humans notice delays of more than 300ms in conversations. Our target was sub-500ms from user speech ending to AI speech beginning.
By running everything at the edge, we hit that target consistently:
- Turn detection: ~50ms
- Speech-to-text: ~150ms
- LLM first token: ~100ms
- TTS first audio: ~100ms
Total: ~400ms - fast enough to feel natural.
Trade-offs We Made
To hit this price point, we made conscious trade-offs:
Llama vs GPT-4: We use Llama 3.3 70B instead of GPT-4. For complex reasoning tasks, GPT-4 is better. For voice customer support conversations, Llama is more than sufficient - and 10x cheaper.
Deepgram vs ElevenLabs: ElevenLabs has slightly more natural voices. Deepgram is good enough for business use cases and significantly cheaper.
Telnyx vs Twilio: Twilio has better brand recognition. Telnyx has comparable quality at 1/6th the price.
For premium users who want the absolute best, we offer GPT-4 and ElevenLabs as paid upgrades. But 95% of users don't need them.
What We Learned
Building this stack taught us several things:
- Most AI costs are margin, not compute. The actual inference cost for voice AI is low. Platforms charge premiums because they can, not because they need to.
- Cloudflare Workers AI is underrated. Running inference at the edge with predictable pricing and no capacity planning is a game-changer.
- Telnyx is a hidden gem. They own their network and pass savings to customers. Twilio's premium is mostly brand tax.
- Latency beats accuracy for voice. Users prefer a slightly less perfect response that comes quickly over a perfect response that takes 2 seconds.
Try It Yourself
We built Gabster because we believe every business should have access to AI voice agents, not just enterprises with big budgets.
At $0.02/minute, a 5-minute customer support call costs $0.10. That's the price of a support email, but with real-time voice interaction.
Ready to try it? Sign up for free and deploy your first voice agent in minutes.
Ready to Build Your AI Agent?
Start free. Deploy in minutes. No credit card required.
Get Started Free