After a user turn, the agent starts speaking, gets interrupted partway through its utterance, pauses, and then plays the same utterance again from the beginning. The first chunk that plays before the cutoff varies — sometimes it’s a single mora, sometimes it’s two or three words — but the pattern is consistent: a partial start, a 1–2 second gap, and then the full utterance restarts. The caller hears the agent begin its reply, stop, and then restart from the top.
We have already tried lowering interruption_sensitivity (currently 0.5); the behavior still reproduces. There is no actual user speech during the gap, so this does not appear to be an interruption being correctly detected.
This is reproducible across Cartesia (Aiko, Yuki) and Minimax (Yumi) voices, so it does not appear to be voice-provider-specific.
Reference call
call_06933d2379c5902fe880d2b5504
-
Agent: agent_9fbf26727e8e36c3ad8e50298e (voiceai_stg_3d5a9af7)
-
Voice: minimax-Yumi · voice_speed: 1.3 · voice_temperature: 0.8
-
Language: multi
-
Conversation flow: conversation_flow_01a359b84c29
-
LLM: GPT-5.1 (Retell-managed)
-
interruption_sensitivity: 0.5, enable_backchannel: false
Transcript with timing (Retell-provided)
[ 14.45s-> 15.25s] agent: チェ
<-── 1.18s gap (no user speech) ──->
[ 16.43s-> 18.43s] agent: チェックイ
[ 18.43s-> 24.62s] agent: ン時間のご変更でございますね。ただいま、チェックイン時間の
変更可否をお調べいたしますので、少々お待ちいただけますでしょうか。
[ 47.71s-> 49.95s] user: の電話番号を教えてもらえますか。
[ 51.65s-> 52.85s] agent: チェッ
[ 52.85s-> 55.65s] agent: クイン時間変更
[ 55.65s-> 61.30s] agent: のご相談先のお電話番号でございますね。いま担当窓口の
連絡先をお調べいたしますので、少々お待ちいただけますでしょうか。
In both turns, the agent begins its reply, stops mid-word, pauses, and then plays the same utterance from the beginning. There is no user speech in the gap.
The chunk that plays before the cutoff is short here, but in other calls we’ve seen the same pattern with two or three full words playing before the stop-and-restart. The size of the first chunk varies; the stop-and-restart pattern is consistent.
Our agent design (relevant context)
- Conversation-flow based (not single-prompt). 5 nodes: greeting → rag_lookup (function) → rag_response (conversation) → anything_else (conversation) → farewell.
- One custom tool: rag_lookup_<supplier_id> — POST to our backend’s RAG endpoint, 15s timeout.
- Speak-during-execution enabled on the tool node.
- language: “multi”, normalize_for_speech: false, enable_backchannel: false.
- interruption_sensitivity: 0.5 (we’ve already tried tuning this).
What we’d like to know
Why does the agent start an utterance, stop partway through, and then replay the same utterance from the beginning when there is no user speech in the gap? It looks as if Retell is detecting a phantom interruption (or some other condition that aborts the stream) and then regenerating the full utterance for replay. We’ve already tried lowering interruption_sensitivity; the pattern reproduces at 0.5 and across both Cartesia and Minimax voices, so we don’t believe the root cause is voice-provider or interruption-sensitivity tuning.