Severity: High — caller hears English system-prompt text in the middle of a Japanese conversation, including the literal tool ID.
Summary
When an agent invokes a custom tool with speak_during_execution: true, the instruction template that tells the LLM what filler to say is being concatenated into the agent’s spoken output instead of being consumed. The English system text
plays out loud through TTS, sometimes twice in the same turn.
Reproduction
- Agent ID: agent_2c260cc96dc5143125a5664790
- Conversation flow: conversation_flow_963b78c1fdbd (single custom tool: rag_lookup_<supplier_id>)
- Voice: Cartesia (Aiko)
- LLM: Retell-managed (GPT-5.1)
- Language: multi
- speak_during_execution: true on the tool node
- Call type in our reproduction: web call
- Caller utterances: Japanese
- Frequency: at least 2 of ~8 test calls in one test session
Reference calls
Example 1 — short leak
call_69cf7c1215dfefe7eb4005e9ab1
https://dashboard.retellai.com/call-history?history=call_69cf7c1215dfefe7eb4005e9ab1
Excerpt of what the caller heard (one continuous agent utterance):
▎ お子様が0歳で、プールをご利用いただけるかどうかでございますね。こちらを確認いたしますので、少々お待ちくださいませ。
▎ Now you are invoking function.
▎ お子様が0歳でプールをご利用いただけるかについてでございますね。こちらを確認いたしますので、少々お待ちくださいませ。
The English string Now you are invoking function. is spoken aloud, sandwiched between two paraphrases of the same Japanese filler line.
Example 2 — full instruction-template leak
call_7dec8e72f3a888115a1aaef8ac6
https://dashboard.retellai.com/call-history?history=call_7dec8e72f3a888115a1aaef8ac6
Excerpt:
▎ レストランでの喫煙についてでございますね。館内のルールをお調べいたしますので、少々お待ちくださいませ。
▎ Now you are invoking function tool rag_lookup_b28fdb84-b7e8-5f97-bab4-324df5d6e43f, and you should come up with what to say to user during execution given this instruction: Acknowledge the caller’s question by briefly paraphrasing what
▎ they asked about, then let them know you’re looking it up. Keep it natural and conversational. Example: “Let me check on [topic] for you, one moment please.”
▎ レストランでの喫煙についてでございますね。館内のルールをお調べいたしますので、少々お待ちくださいませ。
▎ Now you are invoking function tool rag_lookup_b28fdb84-b7e8-5f97-bab4-324df5d6e43f, and you should come up with what to say to user during execution given this instruction: …
In this call the leak is the complete speak_during_execution instruction template — including the literal tool ID and the prompt’s own example sentence (“Let me check on [topic] for you, one moment please.”). It plays through twice in the
same utterance.
What we’d expect
The speak_during_execution instruction is a system-level directive telling the LLM what filler to produce; only the LLM’s response to that instruction (the filler itself) should reach the TTS. The instruction text and the tool ID should
never be spoken.
What we’d like investigated
- Why is the speak_during_execution instruction text being forwarded to TTS along with the LLM’s filler response?
- Is this consistent across LLM versions (GPT-5.1 vs. GPT-4.1)? We have a hunch this regressed with GPT-5.1.
- Is there a way to suppress the instruction echo — e.g. a flag on the tool, or a known-good prompt template — until the upstream fix lands?