Is there any guidance anywhere regarding improving transcription accuracy with custom settings?
Changing the endpointing slider does not reflect accurate changes in the latency number at the top of the dashboard.
Is there any guidance anywhere regarding improving transcription accuracy with custom settings?
Changing the endpointing slider does not reflect accurate changes in the latency number at the top of the dashboard.
Yes, there is guidance for improving transcription accuracy with custom settings.(1)(2)(3)
Custom Transcription Configuration
You can configure custom transcription settings using the custom_stt_config field, which allows you to specify the provider and endpointing settings.(3)
The configuration includes:(3)
provider: Choose between providers like “azure”endpointing_ms: Set the endpointing value in millisecondsTranscription Mode Options
Retell AI offers two main transcription modes to balance accuracy and latency:(2)
From benchmarking, both modes have similar WER (Word Error Rate).(2) The main difference lies in capturing entities like numbers and dates.(2) If your use case relies heavily on capturing these entities well, you should use the optimize for accuracy mode.(2) Otherwise, you can use the optimize for speed mode for best latency.(2)
Additional Accuracy Improvements
For special words or domain-specific terms, you can use Boosted Keywords:(1)
If transcription errors occur due to background noise or speech, you can adjust the denoising mode setting.(1)
If sentences are being cut off prematurely, you can turn on transcription mode to be optimized for accuracy.(1)
Regarding your specific issue with the endpointing slider not reflecting accurate changes in the latency number at the top of the dashboard, I don’t have information in the available sources that directly addresses this dashboard behavior issue.
Hello @mikefmail
For custom transcription, it depends on which language you want to use; some languages are better with Deepgram and others with Azure. I would set endpointing to 0.2s and test both providers to see what works best for you.
I understand you’re looking for guidance on custom transcription settings in Retell AI. Based on the available documentation, I can provide some context about transcription configuration.
Retell AI supports custom STT (Speech-to-Text) configuration through the custom_stt_config object(1)(2)(3). When you set stt_mode to "custom", you can configure the custom_stt_config with a provider field (such as "azure") and an endpointing_ms parameter(2).
The endpointing_ms parameter in the custom STT configuration controls the endpointing behavior for speech recognition(1)(2)(3). However, the specific optimal values for different providers and languages aren’t detailed in the official documentation I have access to.
Regarding ASR quality, Retell AI has acknowledged that ASR quality directly impacts every part of the AI voice agent experience(4). Strong ASR delivers faster, more accurate conversations and better intent recognition(4).
For improving transcription accuracy, you can also use boosted_keywords to provide a customized list of keywords to bias the transcriber model, making certain words more likely to get transcribed correctly(1)(5)(2)(3). This is commonly used for names, brands, and street names(1).
If you need more specific guidance on provider selection and endpointing values for your particular use case, I’d recommend checking out these resources:
Hey! Thanks for the reply! I’m still a little unsure - I’m working with english language. If I select “Optimize for accuracy”, the tool tip says Deepgram 300ms and the latency reporter at the top of the screen says Transcription 70-250ms. However, if I use “customize settings” and set it to 400ms, the latency is reported as 20-50ms. This is confusing and I’m just not sure how to set custom settings to get better transcription results than the “optimize for accuracy” defaults.
The confusion you’re experiencing is common - the latency shown at the top is overall pipeline latency (ASR + LLM + TTS), not just the ASR setting you changed(1). Even if you adjust the transcription config, the displayed latency may not shift much unless the change directly impacts the full processing chain(1).
For improving transcription accuracy specifically with English, the documentation recommends using Boosted Keywords(2). You can add custom keywords to enhance the model’s vocabulary, with support for up to 100 custom keywords(2). This is particularly helpful for specific words or domain-specific terms that might be missing from transcripts(2).
Additionally, if transcription quality is impacted because sentences are being cut off (the transcription outputs the finalized sentence before it should), you can turn on transcription mode to be optimized for accuracy(2).
The endpointing setting you’re adjusting controls when the system decides the speaker has finished, but it primarily affects transcription quality rather than the overall latency metric displayed at the top(3). The displayed latency encompasses the entire conversation pipeline, so ASR endpointing changes alone won’t dramatically shift that number(3).
For further assistance with your specific configuration:
Hello @mike.fetherling,
Thanks for pointing that out. It seems like a UI issue where the estimated transcription latency is not updating, and I’ve reported it to the team.
In general, increasing the endpoint will result in better accuracy but higher latency, so test it and find the best balance for the use case.
You can also use boosted keywords