A 30B Qwen Model Walks into a Raspberry Pi and Runs in Real Time

https://news.ycombinator.com/rss Hits: 22

Summary

For this release, we optimize for what people actually experience when they run a model: fast, high-quality responses on a specific target device. We use Shapelearn, our bitlength learning method to choose weight datatypes for Qwen3-30B-A3B-Instruct-2507 that maximize performance in terms of tokens per second (TPS) and output quality, with one practical constraint: the model must fit comfortably in the available memory. Once it fits, making the file smaller isn't a goal by itself. We only shrink further when it also improves the real tradeoff people care about: speed vs. quality. Approaching bitlength learning this way matters because in llama.cpp, "fewer bits" doesn't automatically mean "more speed." Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory. Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality. CPUs On CPUs, the reducing footprint via shorter bitlengths affects the TPS and accuracy tradeoff as one would expect: once the model fits, reducing footprint tends to increase TPS in a fairly monotonic way. If datatypes are selected correctly, you can trade a bit of quality for speed predictably, which makes it much easier to pick a point on the curve that matches your constraints. We'll start with the most memory-constrained CPU case (Raspberry Pi 5 16GB), where "fits in RAM" is the limiting factor, then move to an Intel i7 with 64GB, where everything fits. Raspberry Pi 5 The figure below shows TPS vs. normalized accuracy for the models that fit in RAM on the Raspberry Pi 5 16GB. Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Notably, sustaining 8.5 TPS at 92%+ baseline accuracy with a 30B model on a Raspberry Pi reshapes expectations for Pi-class systems. Overall, the trend shows that ShapeLearn consistently pr...

First seen: 2026-01-06 21:38

Last seen: 2026-01-07 18:44

Read Full Article More from this Source

A 30B Qwen Model Walks into a Raspberry Pi and Runs in Real Time

Summary

Related News

The Concise TypeScript Book

'Bandersnatch': The Works That Inspired the 'Black Mirror' Interactive Feature (2019)

I build products to get "unplugged" from the internet

CPU Counters on Apple Silicon: article + tool

Datadog, Thank You for Blocking Us