I’m evaluating the economics of a Hebrew voice agent and want to understand how billing works when using ElevenLabs v3.
If I have a 2-minute phone call where the AI only speaks for 30 seconds total and spends the rest of the time listening to the caller, how is TTS billed?
Specifically:
Is ElevenLabs TTS charged based on the total call duration, or only on the number of seconds of audio generated by the AI?
If the AI speaks for 30 seconds during a 2-minute call, would I be billed for 30 seconds of ElevenLabs TTS or the full 2 minutes?
Does the same billing model apply when using a cloned ElevenLabs voice?
Is there a way to see a breakdown of costs by Voice Infrastructure, LLM usage, and TTS usage for a specific call?
For a structured intake flow where the AI mostly listens and only asks a few questions, what is the best way to estimate the true per-call cost?
Hey @beckermanyaakov Short answer: all voice-side components bill on total connected call duration, not on TTS audio seconds generated. A 2-minute call where the agent speaks for 30s is billed as 2 minutes on Voice Infrastructure, TTS, and LLM. This is because STT and the voice pipeline stay active and listening throughout the call, even during silence.
Specifically:
TTS (ElevenLabs v3): billed on full call duration. Same rate applies to cloned ElevenLabs voices — no premium for cloned voices over stock ElevenLabs voices.
Per-call cost breakdown: open any call in Dashboard → Call History — the detail panel shows the cost split across Voice Infrastructure, TTS, LLM, telephony, and KB.
Estimating per-call cost: use the calculator on AI Phone Agent Pricing | Retell AI — pick your TTS / LLM / telephony components to get a $/min, then multiply by expected average call duration. For an intake flow where the agent mostly listens, the dominant variable is total call duration, not speak ratio.
Only exception worth flagging: on a transferred call, the AI voice agent fee stops at the moment of transfer; only telephony continues for the remainder of the bridged leg.