Persistent issue - gpt-realtime doesn't detect short phrases like "yes"

I keep having a serious issue with the gpt-realtime model.

Often, but not always, when a user says a short phrase like “yes”, the model doesn’t catch it. This results in multiple turns like “Do you want this?” - “Yes” - “Please tell me if you want this” - “YES” - “Are you there?” - “YES!!” - etc where the user gets increasingly frustrated.

I’ve tried setting up an identical agent directly with the OpenAI Realtime API, and I cannot reproduce the issue there. Thus, I suspect the issue is somewhere in the Retell implementation of the speech-to-speech gpt-realtime model.

Please help, this is a mayor problem for my business.


Your issue with gpt-realtime not catching short phrases like “yes” has been reported by other users.(1) A Discord community member described the exact same problem: the Retell agent “almost never understands short phrases like ‘yes’ and ‘yes please’”, while the same agent implemented directly via OpenAI’s API never misses these phrases.(1)

When this was reported to Retell support, they attempted to reproduce the issue but indicated “it’s working fine” and asked what model was being used.(1) The user confirmed they were using GPT Realtime.(1)

Available gpt-realtime models in Retell AI:

According to the API documentation, you can select from these speech-to-speech models using the s2s_model parameter:(2)(3)(4)

  • gpt-4o-realtime
  • gpt-4o-mini-realtime
  • gpt-realtime

The realtime APIs are supported for Single and Multi prompt agent types.(5)

Potential factors affecting speech detection:

Retell AI’s turn-taking model combines Voice Activity Detection (VAD) with contextual analysis to determine when a user has finished speaking.(6) The system uses VAD to identify speech presence, then applies turn-taking logic to analyze prosodic features like pitch, intonation, and timing.(6)

Retell AI’s turn-taking model is designed to prevent interruptions by recognizing subtle cues like tone shifts, pauses, and sentence patterns.(6) However, this contextual approach means the system waits to ensure the user hasn’t finished talking, which could potentially cause issues with very short utterances.(6)

Recommended next steps:

Since this appears to be a known issue that Retell support is aware of,(1) I recommend:

  1. Contact Retell support directly with specific call examples showing the issue
  2. Verify which exact s2s_model you’re using (gpt-4o-realtime, gpt-4o-mini-realtime, or gpt-realtime)
  3. Test with different realtime model variants to see if the issue persists

For immediate assistance:


**