Building voice agents with Nvidia open models

https://news.ycombinator.com/rss Hits: 10
Summary

How to Build Ultra-low-latency Voice Agents With NVIDIA Cache-aware Streaming ASRThis post accompanies the launch of NVIDIA Nemotron Speech ASR on Hugging Face. Read the full model announcement here.In this post, we’ll build a voice agent using three NVIDIA open models:This voice agent leverages the new streaming ASR model, Pipecat’s low-latency voice agent building blocks, and some fun code experiments to optimize all three models for very fast response times.VIDEOAll the code for the post is here in this GitHub repository.You can clone the repo and run this voice agent:Scalably for multi-user workloads on the Modal cloud platform.On an NVIDIA DGX Spark or RTX 5090 for single-user, local development and experimentation.Feel free to just jump over to the code. Or read on for technical notes about building fast voice agents and the NVIDIA open models.The state of voice AI agents in 2026Voice agent deployments are growing by leaps and bounds across a wide range of use cases. For example, we’re seeing voice agents used at scale today in:Customer supportAnswering the phone for small businesses (for example, restaurants)User researchOutbound phone calls to prepare patients for healthcare appointmentsValidation workflows for loan applicationsAnd many, many other scenariosBoth startups and large, established companies are building voice agents that are successful in real-world deployments. The best voice agents today achieve very high “task completed” success metrics and customer satisfaction scores.Voice AI architectureAs is the case with everything in AI, voice agent technology is evolving rapidly. Today, there are two ways to build voice agents.Most production voice agents use specialized models together in a pipeline – a speech-to-text model, a text-mode LLM, and a text-to-speech model.Voice agent developers are beginning to experiment with new speech-to-speech models that take voice input directly and output audio instead of text.On the left, a block diagram of a voic...

First seen: 2026-01-07 18:44

Last seen: 2026-01-08 03:46