Ask HN: What's the current best local/open speech-to-speech setup?

https://news.ycombinator.com/rss Hits: 2
Summary

I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.What are people actually using in 2026 if they want open + local voice?Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?What’s the most “works today” combo on a single GPU?Bonus: rough numbers people see for mic → first audio backWould love pointers to repos, configs, or “this is the one that finally worked for me” war stories.

First seen: 2026-01-23 23:49

Last seen: 2026-01-24 00:49