How accurate are these voice AI cost estimates?

The calculator uses published per-minute, per-character, and per-token rates from each provider and applies industry-standard call shape assumptions. Real bills shift with volume discounts, custom contracts, retries, silence trimming, and barge-in handling. Treat the output as a planning range, not a quote.

What drives the total cost of a voice agent?

Four lines: speech-to-text per minute, LLM tokens (input plus output), text-to-speech per character, and hosting per vCPU minute. Conversation length is the biggest multiplier because input tokens grow with history. Concurrency per vCPU controls hosting cost per call.

What latency target should a voice agent hit?

Under 200ms voice-to-voice feels natural. 200 to 500ms is acceptable for transactional calls. Over 500ms feels slow and callers start talking over the agent. The latency panel below shows where time is spent across input, AI processing, and output paths.

Can I save or export the calculation?

Not yet. The calculator runs in your browser and resets on reload. Screenshot the breakdown or copy the inputs into your own sheet for now.

How often is provider pricing updated?

Rates are hardcoded against published pricing at the time of last update. Always verify the live rate with each provider before committing budget. Every input is editable, so you can override any rate that has moved.

What is the difference between input and output tokens?

Input tokens are everything the model reads: system prompt, tools, conversation history, and the new user turn. Output tokens are what the model writes back. Input grows quadratically with conversation length because history is replayed each turn; output grows linearly.

How do I pick the best provider mix?

Lead with the constraint: cost, latency, voice quality, or compliance. For cost, pair Deepgram Nova or Whisper for STT with Flash V2.5 or Aura-2 for TTS and GPT-4o Mini or Llama for the LLM. For latency, pick the fastest STT and TTS even at a higher rate. For voice quality, ElevenLabs Multilingual carries the most expressive output.

What does agents per vCPU mean?

Concurrent voice sessions a single vCPU can serve before performance degrades. Higher concurrency cuts hosting cost per call but raises latency variance under load. Most production stacks run 4 to 16 agents per vCPU depending on STT and barge-in load.

How accurate are these voice AI cost estimates?

The calculator uses published per-minute, per-character, and per-token rates from each provider and applies industry-standard call shape assumptions. Real bills shift with volume discounts, custom contracts, retries, silence trimming, and barge-in handling. Treat the output as a planning range, not a quote.

What drives the total cost of a voice agent?

Four lines: speech-to-text per minute, LLM tokens (input plus output), text-to-speech per character, and hosting per vCPU minute. Conversation length is the biggest multiplier because input tokens grow with history. Concurrency per vCPU controls hosting cost per call.

What latency target should a voice agent hit?

Under 200ms voice-to-voice feels natural. 200 to 500ms is acceptable for transactional calls. Over 500ms feels slow and callers start talking over the agent. The latency panel below shows where time is spent across input, AI processing, and output paths.

Can I save or export the calculation?

Not yet. The calculator runs in your browser and resets on reload. Screenshot the breakdown or copy the inputs into your own sheet for now.

How often is provider pricing updated?

Rates are hardcoded against published pricing at the time of last update. Always verify the live rate with each provider before committing budget. Every input is editable, so you can override any rate that has moved.

What is the difference between input and output tokens?

Input tokens are everything the model reads: system prompt, tools, conversation history, and the new user turn. Output tokens are what the model writes back. Input grows quadratically with conversation length because history is replayed each turn; output grows linearly.

How do I pick the best provider mix?

Lead with the constraint: cost, latency, voice quality, or compliance. For cost, pair Deepgram Nova or Whisper for STT with Flash V2.5 or Aura-2 for TTS and GPT-4o Mini or Llama for the LLM. For latency, pick the fastest STT and TTS even at a higher rate. For voice quality, ElevenLabs Multilingual carries the most expressive output.

What does agents per vCPU mean?

Concurrent voice sessions a single vCPU can serve before performance degrades. Higher concurrency cuts hosting cost per call but raises latency variance under load. Most production stacks run 4 to 16 agents per vCPU depending on STT and barge-in load.

Price a voice agent before you build it.

Compare STT, TTS, and LLM providers at your real call volume. See cost per minute, cost per call, and where the money goes.

No signup
Edit every rate
Includes hosting and latency

01 Providers

Pick your stack

LLM

Speech-to-text (STT)

Text-to-speech (TTS)

02 Volume and call shape

Tell us what a call looks like

Monthly calls

Average call length (minutes)

Concurrent agents per vCPU

Words per minute (speaking pace)

Turns per minute

Tokens per word

Characters per word

Share of speech produced by the agent50%

Caller speaks moreAgent speaks more

03 Rate overrides

Override any rate

Defaults load from the provider you picked. Edit any line to model a custom contract or volume discount.

STT cost per minute ($)

TTS cost per character ($)

LLM input cost per token ($)

LLM output cost per token ($)

vCPU cost per minute ($)

Fixed monthly hosting ($)

04 Result

Your cost breakdown

Per minute

$0.0166

Per call

$0.083

Per month

$415

Speech-to-text36.1%

LLM36.7%

Text-to-speech27.1%

Hosting0.1%

Speech-to-text$0.0300

LLM$0.0305

Text-to-speech$0.0225

Hosting$0.0001

Per call$0.0831

Tokens per call

Input

10,238

Output

488

Total

10,726

Input tokens grow with every turn because conversation history is replayed each time.

Planning estimate. Real bills shift with retries, volume discounts, and silence trimming.

05 Latency

Where the milliseconds go

Voice-to-voice latency is the only number your caller feels. Tune each stage to see what moves the total and what your stack needs to stay under 500ms.

Voice-to-voice latency

825msSluggish

0ms1000ms

Input path

Input

135ms

16%

AI processing

555ms

67%

Output path

Output

135ms

16%

Latency guide:Natural ≤ 200msAcceptable 201–500msSluggish > 500ms

06 Stage timings

Per-stage breakdown

Input path

Mic input

Opus encoding

Network transit

Packet handling

Jitter buffer

Opus decoding

AI processing

Transcription (STT)

LLM inference

Sentence aggregation

Text-to-speech (TTS)

Output path

Opus encoding

Packet handling

Network transit

Jitter buffer

Opus decoding

Speaker output

07 Methodology

How the numbers are built

Every line above is auditable. Below is the math, in the order the calculator runs it.

A typical voice agent costs $0.05 to $0.30 per minute, driven by three lines: speech-to-text (STT), the language model (LLM), and text-to-speech (TTS). The LLM is usually the largest line because input tokens grow with conversation history. Hosting adds a small per-minute charge that drops as concurrency per vCPU rises.

Speech-to-text

Provider rate times minutes of audio processed in the call.

STT cost = call length (min) × rate per minute

Text-to-speech

Characters the agent generates times the per-character rate.

TTS cost = words × characters/word × agent speech share × rate per character

LLM input

Conversation history grows each turn, so input tokens scale quadratically with call length.

Input tokens = (words/min × tokens/word ÷ turns/min) × (turns/min × length) × (turns/min × length + 1) ÷ 2

LLM output

Output grows linearly with how much the agent speaks.

Output tokens = words/min × tokens/word × agent speech share × call length

LLM cost

Input and output tokens billed at their respective rates.

LLM cost = input tokens × input rate + output tokens × output rate

Hosting

vCPU minute rate divided by the agents you can run on each vCPU.

Hosting cost = (vCPU rate × call length) ÷ agents per vCPU

Cost per call

Everything a single call burns across all four lines.

Per call = STT + LLM + TTS + hosting

Monthly cost

Per-call cost scaled by monthly volume, plus any fixed hosting.

Monthly = per call × monthly calls + fixed hosting

08 Latency

Total voice-to-voice latency

End-to-end delay from the caller speaking to the agent voice arriving back. Every stage in the latency panel adds to this total.

Total = mic + opus encode + network + packet + jitter + opus decode + transcription + LLM + sentence aggregation + TTS + opus encode + packet + network + jitter + opus decode + speaker

How it works

Pulls current per-minute (STT), per-character (TTS), and per-token (LLM) rates from major providers, then scales them against call volume, conversation length, words per minute, and turn rate. Input tokens use a quadratic growth model that mirrors how conversation history accumulates each turn. Hosting cost is per-vCPU-minute divided by concurrent agents.

Related tools

Planning

AI Project Cost Estimator

Stop guessing. Know your AI budget in 2 minutes.

Try it

Strategy

AI Use Case Finder

Find the AI project that pays for itself first.

Try it

Planning

Build vs Buy Calculator

The real cost of building in-house (most teams miss 40%).

Try it

Need a real number for your stack?

A 30-minute call with a RaftLabs founder turns this estimate into a build plan: provider shortlist, latency budget, and a fixed-price scope.

Voice AI cost questions

Sizing a voice agent rollout, validated against real production deployments.

: The calculator uses published per-minute, per-character, and per-token rates from each provider and applies industry-standard call shape assumptions. Real bills shift with volume discounts, custom contracts, retries, silence trimming, and barge-in handling. Treat the output as a planning range, not a quote.
: Four lines: speech-to-text per minute, LLM tokens (input plus output), text-to-speech per character, and hosting per vCPU minute. Conversation length is the biggest multiplier because input tokens grow with history. Concurrency per vCPU controls hosting cost per call.
: Under 200ms voice-to-voice feels natural. 200 to 500ms is acceptable for transactional calls. Over 500ms feels slow and callers start talking over the agent. The latency panel below shows where time is spent across input, AI processing, and output paths.
: Not yet. The calculator runs in your browser and resets on reload. Screenshot the breakdown or copy the inputs into your own sheet for now.
: Rates are hardcoded against published pricing at the time of last update. Always verify the live rate with each provider before committing budget. Every input is editable, so you can override any rate that has moved.
: Input tokens are everything the model reads: system prompt, tools, conversation history, and the new user turn. Output tokens are what the model writes back. Input grows quadratically with conversation length because history is replayed each turn; output grows linearly.
: Lead with the constraint: cost, latency, voice quality, or compliance. For cost, pair Deepgram Nova or Whisper for STT with Flash V2.5 or Aura-2 for TTS and GPT-4o Mini or Llama for the LLM. For latency, pick the fastest STT and TTS even at a higher rate. For voice quality, ElevenLabs Multilingual carries the most expressive output.
: Concurrent voice sessions a single vCPU can serve before performance degrades. Higher concurrency cuts hosting cost per call but raises latency variance under load. Most production stacks run 4 to 16 agents per vCPU depending on STT and barge-in load.