Technical

How We Built a $0.02/min Voice AI Agent

Gabster Team 12 min read

When we set out to build Gabster's voice AI capabilities, we had one goal: deliver the same quality as premium voice platforms at a fraction of the cost. Most competitors charge $0.10-0.15 per minute. We wanted to hit $0.02.

This is the technical story of how we achieved it.

The Problem with Existing Voice AI Stacks

Most voice AI platforms are assembled from expensive, best-in-class components:

  • Twilio for telephony - $0.013/min just for the phone connection
  • OpenAI Whisper API for speech-to-text - $0.006/min
  • GPT-4 for reasoning - variable but expensive
  • ElevenLabs for text-to-speech - $0.18 per 1,000 characters

Before any margin, these components alone cost $0.08-0.12 per minute. Add infrastructure, monitoring, and a healthy margin, and you're easily at $0.14+/min.

We asked: what if we optimized each layer from first principles?

Layer 1: Telephony with Telnyx

The first win came from choosing Telnyx over Twilio. While Twilio dominates the market, Telnyx offers comparable quality at dramatically lower prices.

Provider Voice API Notes
Twilio $0.013/min Industry standard
Telnyx $0.002/min Owns their network

The difference? Telnyx owns their own global carrier network instead of leasing capacity. This vertical integration translates directly to lower costs.

From a developer experience standpoint, Telnyx is equally good - maybe better. Their WebRTC and SIP support is excellent, and their documentation is comprehensive.

// Telnyx call handling is straightforward
app.post('/voice/incoming', async (req, res) => {
  const call = await telnyx.calls.create({
    connection_id: process.env.TELNYX_CONNECTION_ID,
    to: req.body.to,
    from: process.env.TELNYX_PHONE_NUMBER,
    webhook_url: 'https://api.gabster.link/voice/events'
  });
});

Layer 2: Speech-to-Text on Cloudflare

For speech recognition, we use Cloudflare Workers AI with Deepgram's nova-3 model. This is a relatively new offering that provides fast, accurate transcription at edge locations worldwide.

Cost: $0.0052 per minute of audio

The key advantage isn't just price - it's latency. Because Cloudflare runs inference at edge locations, the audio doesn't need to travel to a central data center. This reduces round-trip latency, which is critical for natural-sounding conversations.

// Transcribe audio chunk using Workers AI
const transcription = await env.AI.run(
  '@cf/deepgram/nova-3',
  {
    audio: audioBuffer,
    language: 'en'
  }
);

Layer 3: AI Reasoning with Llama 3.3 70B

For the "thinking" part of the voice agent, we use Meta's Llama 3.3 70B model running on Cloudflare Workers AI. This is where things get interesting.

GPT-4 would give us slightly better reasoning, but at 10-20x the cost. For voice agents, the difference is rarely noticeable because:

  1. Conversations are inherently simpler than complex text tasks
  2. Context windows are smaller (you're not processing documents)
  3. Speed matters more than perfect accuracy

Llama 3.3 70B hits the sweet spot. It's fast (critical for voice latency), cost-effective, and more than capable of handling customer support conversations.

Cost: ~$0.0015 per minute of conversation

const response = await env.AI.run(
  '@cf/meta/llama-3.3-70b-instruct-fp8-fast',
  {
    messages: [
      { role: 'system', content: agentSystemPrompt },
      ...conversationHistory,
      { role: 'user', content: transcribedText }
    ],
    stream: true,
    max_tokens: 256  // Voice responses should be concise
  }
);

Layer 4: Turn Detection

One of the trickiest parts of voice AI is knowing when the user has finished speaking. Too aggressive, and you interrupt them. Too passive, and there are awkward silences.

We use Cloudflare's smart-turn-v2 model, which is specifically trained to detect conversation turn-taking patterns.

Cost: $0.0003 per minute

This tiny model runs continuously during the call, analyzing audio features to predict when the user is done speaking. It considers:

  • Silence duration
  • Pitch patterns (voices typically fall at the end of sentences)
  • Speech rate changes

Layer 5: Text-to-Speech with Deepgram Aura

For converting the AI's response back to speech, we use Deepgram's Aura model via Cloudflare.

Cost: $0.015 per 1,000 characters (roughly $0.011 per minute)

The quality is excellent - natural-sounding, with good prosody. We stream the audio as it's generated, so the user starts hearing the response while the AI is still generating the rest.

// Stream TTS as it's generated
const audioStream = await env.AI.run(
  '@cf/deepgram/aura-1',
  {
    text: aiResponse,
    voice: 'asteria',  // Female voice
    model: 'aura-asteria-en'
  }
);

// Pipe directly to Telnyx call
await telnyx.calls.speak(callId, audioStream);

The Full Stack: Cost Breakdown

When you add it all up:

Component Provider Cost/min
Telephony Telnyx $0.002
Speech-to-Text Cloudflare (Deepgram) $0.0052
AI Reasoning Cloudflare (Llama 3.3) $0.0015
Turn Detection Cloudflare $0.0003
Text-to-Speech Cloudflare (Deepgram) $0.011
Total ~$0.018/min

Add a small margin for infrastructure and we're at ~$0.02 per minute.

Compare this to the $0.14+ competitors charge. That's a 7x cost reduction with no compromise in quality.

Architecture: Putting It Together

The full architecture runs entirely on Cloudflare's edge network:

Phone Call (Telnyx)
    ↓
Cloudflare Worker (Edge)
    ├── STT: Deepgram nova-3
    ├── AI: Llama 3.3 70B
    ├── Turn Detection: smart-turn-v2
    └── TTS: Deepgram Aura
    ↓
Phone Call (Telnyx)

Everything runs at the edge location closest to the caller. There's no central server bottleneck. This architecture naturally scales - Cloudflare handles thousands of concurrent calls without any capacity planning on our part.

Latency Considerations

For voice AI, latency is everything. Humans notice delays of more than 300ms in conversations. Our target was sub-500ms from user speech ending to AI speech beginning.

By running everything at the edge, we hit that target consistently:

  • Turn detection: ~50ms
  • Speech-to-text: ~150ms
  • LLM first token: ~100ms
  • TTS first audio: ~100ms

Total: ~400ms - fast enough to feel natural.

Trade-offs We Made

To hit this price point, we made conscious trade-offs:

Llama vs GPT-4: We use Llama 3.3 70B instead of GPT-4. For complex reasoning tasks, GPT-4 is better. For voice customer support conversations, Llama is more than sufficient - and 10x cheaper.

Deepgram vs ElevenLabs: ElevenLabs has slightly more natural voices. Deepgram is good enough for business use cases and significantly cheaper.

Telnyx vs Twilio: Twilio has better brand recognition. Telnyx has comparable quality at 1/6th the price.

For premium users who want the absolute best, we offer GPT-4 and ElevenLabs as paid upgrades. But 95% of users don't need them.

What We Learned

Building this stack taught us several things:

  1. Most AI costs are margin, not compute. The actual inference cost for voice AI is low. Platforms charge premiums because they can, not because they need to.
  2. Cloudflare Workers AI is underrated. Running inference at the edge with predictable pricing and no capacity planning is a game-changer.
  3. Telnyx is a hidden gem. They own their network and pass savings to customers. Twilio's premium is mostly brand tax.
  4. Latency beats accuracy for voice. Users prefer a slightly less perfect response that comes quickly over a perfect response that takes 2 seconds.

Try It Yourself

We built Gabster because we believe every business should have access to AI voice agents, not just enterprises with big budgets.

At $0.02/minute, a 5-minute customer support call costs $0.10. That's the price of a support email, but with real-time voice interaction.

Ready to try it? Sign up for free and deploy your first voice agent in minutes.

Ready to Build Your AI Agent?

Start free. Deploy in minutes. No credit card required.

Get Started Free