Price a voice agent before you build it.
Compare STT, TTS, and LLM providers at your real call volume. See cost per minute, cost per call, and where the money goes.
- No signup
- Edit every rate
- Includes hosting and latency
01 Providers
Pick your stack
02 Volume and call shape
Tell us what a call looks like
03 Rate overrides
Override any rate
Defaults load from the provider you picked. Edit any line to model a custom contract or volume discount.
04 Result
Your cost breakdown
Per minute
$0.0166
Per call
$0.083
Per month
$415
Tokens per call
Input
10,238
Output
488
Total
10,726
Input tokens grow with every turn because conversation history is replayed each time.
Planning estimate. Real bills shift with retries, volume discounts, and silence trimming.
05 Latency
Where the milliseconds go
Voice-to-voice latency is the only number your caller feels. Tune each stage to see what moves the total and what your stack needs to stay under 500ms.
Voice-to-voice latency
06 Stage timings
Per-stage breakdown
Input path
AI processing
Output path
07 Methodology
How the numbers are built
Every line above is auditable. Below is the math, in the order the calculator runs it.
A typical voice agent costs $0.05 to $0.30 per minute, driven by three lines: speech-to-text (STT), the language model (LLM), and text-to-speech (TTS). The LLM is usually the largest line because input tokens grow with conversation history. Hosting adds a small per-minute charge that drops as concurrency per vCPU rises.
Speech-to-text
Provider rate times minutes of audio processed in the call.
STT cost = call length (min) × rate per minuteText-to-speech
Characters the agent generates times the per-character rate.
TTS cost = words × characters/word × agent speech share × rate per characterLLM input
Conversation history grows each turn, so input tokens scale quadratically with call length.
Input tokens = (words/min × tokens/word ÷ turns/min) × (turns/min × length) × (turns/min × length + 1) ÷ 2LLM output
Output grows linearly with how much the agent speaks.
Output tokens = words/min × tokens/word × agent speech share × call lengthLLM cost
Input and output tokens billed at their respective rates.
LLM cost = input tokens × input rate + output tokens × output rateHosting
vCPU minute rate divided by the agents you can run on each vCPU.
Hosting cost = (vCPU rate × call length) ÷ agents per vCPUCost per call
Everything a single call burns across all four lines.
Per call = STT + LLM + TTS + hostingMonthly cost
Per-call cost scaled by monthly volume, plus any fixed hosting.
Monthly = per call × monthly calls + fixed hosting08 Latency
Total voice-to-voice latency
End-to-end delay from the caller speaking to the agent voice arriving back. Every stage in the latency panel adds to this total.
Total = mic + opus encode + network + packet + jitter + opus decode + transcription + LLM + sentence aggregation + TTS + opus encode + packet + network + jitter + opus decode + speakerHow it works
Pulls current per-minute (STT), per-character (TTS), and per-token (LLM) rates from major providers, then scales them against call volume, conversation length, words per minute, and turn rate. Input tokens use a quadratic growth model that mirrors how conversation history accumulates each turn. Hosting cost is per-vCPU-minute divided by concurrent agents.
Need a real number for your stack?
A 30-minute call with a RaftLabs founder turns this estimate into a build plan: provider shortlist, latency budget, and a fixed-price scope.
Voice AI cost questions
Sizing a voice agent rollout, validated against real production deployments.
- The calculator uses published per-minute, per-character, and per-token rates from each provider and applies industry-standard call shape assumptions. Real bills shift with volume discounts, custom contracts, retries, silence trimming, and barge-in handling. Treat the output as a planning range, not a quote.
- Four lines: speech-to-text per minute, LLM tokens (input plus output), text-to-speech per character, and hosting per vCPU minute. Conversation length is the biggest multiplier because input tokens grow with history. Concurrency per vCPU controls hosting cost per call.
- Under 200ms voice-to-voice feels natural. 200 to 500ms is acceptable for transactional calls. Over 500ms feels slow and callers start talking over the agent. The latency panel below shows where time is spent across input, AI processing, and output paths.
- Not yet. The calculator runs in your browser and resets on reload. Screenshot the breakdown or copy the inputs into your own sheet for now.
- Rates are hardcoded against published pricing at the time of last update. Always verify the live rate with each provider before committing budget. Every input is editable, so you can override any rate that has moved.
- Input tokens are everything the model reads: system prompt, tools, conversation history, and the new user turn. Output tokens are what the model writes back. Input grows quadratically with conversation length because history is replayed each turn; output grows linearly.
- Lead with the constraint: cost, latency, voice quality, or compliance. For cost, pair Deepgram Nova or Whisper for STT with Flash V2.5 or Aura-2 for TTS and GPT-4o Mini or Llama for the LLM. For latency, pick the fastest STT and TTS even at a higher rate. For voice quality, ElevenLabs Multilingual carries the most expressive output.
- Concurrent voice sessions a single vCPU can serve before performance degrades. Higher concurrency cuts hosting cost per call but raises latency variance under load. Most production stacks run 4 to 16 agents per vCPU depending on STT and barge-in load.