Issues with Portuguese (PT-PT) Voice Agent: Latency, Audio Recognition, and Email Capture

Hello,

I’m currently experiencing several issues with a voice agent built using Conversation Flow, and I would appreciate your help in understanding whether these are configuration issues or platform limitations.

Here are the main problems we are facing:

  1. Voice latency / delayed responses
    The agent is configured in European Portuguese (PT-PT). In many calls, the agent takes some time to respond.

  2. Poor speech recognition, especially with speakerphone
    When the caller is using speakerphone, the agent frequently fails to understand what is being said. In many cases, the user needs to speak very loudly or almost shout to be recognized.
    This sometimes also happens without speakerphone, even when speaking at a normal volume and in a quiet environment.

  3. Major difficulties capturing email addresses
    Email capture by voice is very unreliable:

    • The agent struggles to understand emails correctly

    • It often skips important parts like “@” or “.” (dots) when confirming the email

    • Sometimes it happens to spell the email incorrectly

    • Even though the conversation is in Portuguese, the agent occasionally switches to English words or responses, for example, “@” (at)

    • In some cases, parts of the email are simply ignored or skipped

These issues make it very difficult to collect accurate information (especially email addresses) and significantly impact call quality.

Could you please advise on:

  • Recommended settings for PT-PT voice and ASR

  • Whether there are known limitations or issues with Portuguese (Portugal)

  • Any upcoming improvements related to latency or audio recognition

Thank you in advance for your support.

Best regards,
Sérgio Silva.


Hello Sérgio,

Thank you for reaching out with these detailed observations about your European Portuguese (PT-PT) voice agent.

Known Issues with European Portuguese

Your experience with PT-PT aligns with documented challenges. Users have reported that Portuguese European performs poorly, with voices consistently sounding like Brazilian Portuguese rather than European Portuguese(1). The only decent results achieved have been with Realtime LLM voices, which mostly speak in proper PT-EU, though they annoyingly switch to Brazilian Portuguese at the end of conversations when saying “goodbye”(1).

To improve PT-EU performance, it’s recommended to use Realtime LLM voices and add a prompt like “Respond in European Portuguese” at the start and end to avoid accent switching(1). For best results, clone a native PT-EU speaker’s voice(1).

Speech Recognition and ASR Challenges

Your issues with speech recognition, especially regarding email addresses and phone numbers, reflect broader ASR challenges. Retell AI has acknowledged working on improvements for phone number and email address recognition(2). Users have reported getting names wrong, phone numbers wrong, and words dropping out during transcription(2).

ASR quality directly impacts every part of the AI voice agent experience(3). If the transcription is inaccurate, even the most advanced AI systems will misunderstand the user’s intent and deliver poor results(3). Strong ASR delivers faster, more accurate conversations and better intent recognition(3).

Recent ASR Improvements

Retell AI has rolled out a major upgrade to its Automatic Speech Recognition (ASR) engine, bringing sharper transcription, stronger intent detection, and more reliable call outcomes(4). The new ASR now supports 22+ new languages, bringing the total language count to 50+(4).

For European languages in Accurate mode, Word Error Rate has been cut by 7–10 points(4). This results in clearer transcripts even in noisy environments, more reliable intent capture for complex responses, and fewer repair turns(4).

Latency Considerations

Latency refers to the time delay between a user’s action and the system’s response(5). In AI voice interactions, it’s measured in milliseconds and can make or break the perceived quality of an AI-driven call experience(5). Humans expect near-instantaneous responses, usually within 300-500 milliseconds(5).

Retell AI achieves sub-600ms latency performance, which is crucial for natural conversation flow(6). This low latency ensures that customers experience minimal delays during interactions(6).

Latency can occur at multiple points: Speech Recognition Processing (ASR), Response Generation, Speech Synthesis (TTS), and Network Transmission(5). Optimization strategies include using ultra-fast ASR and TTS engines, deploying AI models closer to the customer’s location, pre-loading likely responses, and optimizing API integrations(5).

Recommendations

Based on the available information:

  1. Use Realtime LLM voices for PT-EU and add explicit prompts to maintain European Portuguese throughout the conversation(1)
  2. Consider voice cloning with a native PT-EU speaker for optimal results(1)
  3. Be aware that email and phone number recognition is an acknowledged area for improvement(2)
  4. Monitor ASR performance as Retell AI continues to enhance transcription accuracy(4)

For additional support with your specific configuration, I recommend:

Would you like me to provide a general answer based on broader voice AI knowledge instead?