Based on the available documentation, here’s how latency works in Retell AI’s architecture:
Where Latency Happens in Voice Automation:(1)
- Speech Recognition Processing (ASR) - Time taken to transcribe spoken words into text(1)
- Response Generation (NLG or LLM) - Time to understand and craft an appropriate, contextual reply(1)
- Speech Synthesis (TTS) - Time to turn the generated text back into spoken words(1)
- Network Transmission - Delays caused by sending audio and data between systems, especially in cloud setups(1)
Regarding TTS Provider Traffic:
The documentation indicates that Retell AI uses a distributed architecture with edge deployment to minimize latency.(2) The platform processes speech through multiple stages, and yes, there is network communication between Retell’s servers and TTS providers.(1)
Additional Troubleshooting Steps:
Since you’ve already tried different TTS providers with no significant change, consider:
- Check if you’re using features marked with
in your configuration, as these add latency(3) - Verify your LLM response time is in the normal range (500ms-900ms)(3)
- Review your denoising mode setting - switching between “noise-cancellation” and “noise-and-background-speech-cancellation” can impact performance(4)
- Check your STT mode - you can choose between “fast” and “accurate” modes(4)
For detailed investigation of your specific issue, contact support with your call IDs through the Discord community or visit Contact Sales.(3)