Compiling models to megakernels

https://news.ycombinator.com/rss Hits: 3

Summary

Luminal is an inference compiler, and as such we’re interested in driving inference right up to the physical limits of the hardware. Inference has two fundamental limitations: compute (flops) and bandwidth (TB/s). Increasing these two requires buying much more expensive hardware, so we want to make sure we’re using all the compute and bandwidth we have available to us! This basically boils down to: anytime the GPU is not loading data, we’re wasting bandwidth, and anytime the GPU is not computing, we’re wasting compute.Let’s look at a typical timeline of executing a transformer layer:A simplified view of a typical transformer forward passWe see two problems immediately:Every time we finish a kernel and start another one, the GPU sits idle while the CPU launches the next kernel.Some streaming-multiprocessor cores (SMs) in the GPU finish their work early and sit idle while other SMs finish the remaining work.Kernel launch overhead is well-known and can be partially mitigated with techniques like CUDA Graphs on Nvidia GPUs. This isn’t perfect, though, as Hazy Research demonstrated in their original megakernel post. With a dummy kernel that does no work, and ordinarily takes 2.1 micros, when CUDA graphs are enabled, it still takes 1.3 micros!The next issue is also a well-known phenomenon called Wave Quantization, which occurs when a kernel’s work cannot be evenly distributed across all SMs, leaving some SMs to finish early and stall while others lag behind to finish the kernel. Depending on the total runtime of the kernels and the shape of the work, these gaps can become very significant!Due to the nature of the tensor computations we’re interested in, we don’t actually have to wait for a full synchronization to begin the next op. Take a tiled matmul for example:Data access patterns of a tiled matmulThis operation does not need to wait for all of tensor A or all of tensor B to begin computing, since it only consumes a stripe of tiles from both A and B. So long as that st...

First seen: 2026-01-26 06:56

Last seen: 2026-01-26 08:57

Read Full Article More from this Source

Compiling models to megakernels

Summary

Related News

The Holy Grail of Linux Binary Compatibility: Musl and Dlopen

Running the Stupid Cricut Software on Linux

The Post Correspondence Programming Language: Domino-oriented Programming (2015)

Things I've learned in my 10 years as an engineering manager

A static site generator written in POSIX shell