Compiling models to megakernels

https://news.ycombinator.com/rss Hits: 3
Summary

Luminal is an inference compiler, and as such we鈥檙e interested in driving inference right up to the physical limits of the hardware. Inference has two fundamental limitations: compute (flops) and bandwidth (TB/s). Increasing these two requires buying much more expensive hardware, so we want to make sure we鈥檙e using all the compute and bandwidth we have available to us! This basically boils down to: anytime the GPU is not loading data, we鈥檙e wasting bandwidth, and anytime the GPU is not computing, we鈥檙e wasting compute.Let鈥檚 look at a typical timeline of executing a transformer layer:A simplified view of a typical transformer forward passWe see two problems immediately:Every time we finish a kernel and start another one, the GPU sits idle while the CPU launches the next kernel.Some streaming-multiprocessor cores (SMs) in the GPU finish their work early and sit idle while other SMs finish the remaining work.Kernel launch overhead is well-known and can be partially mitigated with techniques like CUDA Graphs on Nvidia GPUs. This isn鈥檛 perfect, though, as Hazy Research demonstrated in their original megakernel post. With a dummy kernel that does no work, and ordinarily takes 2.1 micros, when CUDA graphs are enabled, it still takes 1.3 micros!The next issue is also a well-known phenomenon called Wave Quantization, which occurs when a kernel鈥檚 work cannot be evenly distributed across all SMs, leaving some SMs to finish early and stall while others lag behind to finish the kernel. Depending on the total runtime of the kernels and the shape of the work, these gaps can become very significant!Due to the nature of the tensor computations we鈥檙e interested in, we don鈥檛 actually have to wait for a full synchronization to begin the next op. Take a tiled matmul for example:Data access patterns of a tiled matmulThis operation does not need to wait for all of tensor A or all of tensor B to begin computing, since it only consumes a stripe of tiles from both A and B. So long as that st...

First seen: 2026-01-26 06:56

Last seen: 2026-01-26 08:57