NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents.
Here’s a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms.
This agent actually uses *three* NVIDIA open source models:
- Nemotron Speech ASR
- Nemotron 3 Nano 30GB in a 4-bit quant (released in December)
- A preview checkpoint of the upcoming Magpie text-to-speech model
These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.)
The code for this agent is open source too, of course. You can deploy it to production with
and
cloud, or run locally on an
DGX Spark or RTX 5090.