Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks

https://news.ycombinator.com/rss Hits: 7

Summary

Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural NetworksCustom CUDA kernels that eliminate the computational bottlenecks in spherical harmonics and tensor product operations - the core primitives of equivariant GNNs like MACE, NequIP, and Allegro.The Problem: Equivariant GNNs Are Beautiful but SlowEquivariant graph neural networks have revolutionized atomistic machine learning. Models like MACE, NequIP, and Allegro achieve state-of-the-art accuracy in molecular dynamics simulations, materials property prediction, and drug discovery. Their secret: they respect the fundamental symmetries of physical systems - rotation, translation, and reflection invariance.But this mathematical elegance comes at a computational cost. The operations that make these models work - spherical harmonics and Clebsch-Gordan tensor products - are expensive. A single MACE layer can spend 80% of its forward pass time in these two operations.This matters for real applications. Molecular dynamics simulations run billions of timesteps. Battery materials discovery screens millions of candidates. Drug binding affinity predictions evaluate thousands of poses. When each forward pass takes milliseconds instead of microseconds, these workflows become impractical.Understanding the BottleneckTo understand why equivariant GNNs are slow, we need to understand what they compute.Spherical Harmonics: Encoding DirectionsWhen two atoms interact, the direction of their bond matters. A carbon-carbon bond pointing "up" is physically different from one pointing "right" - and our neural network needs to know this.Spherical harmonics (Y_lm) provide a mathematically principled way to encode 3D directions. Given a unit vector (x, y, z), spherical harmonics compute a set of features that transform predictably under rotation:L=0: 1 component (scalar, rotationally invariant)L=1: 3 components (vector, transforms like a 3D vector)L=2: 5 components (transforms like a symmetric traceless matrix)L=3: 7 compon...

First seen: 2026-01-21 11:39

Last seen: 2026-01-21 17:40

Read Full Article More from this Source

Batmobile: 10-20x Faster CUDA Kernels for Equivariant Graph Neural Networks

Summary

Related News

Swedish Alecta has sold off an estimated $8B of US Treasury Bonds

PicoPCMCIA – a PCMCIA development board for retro-computing enthusiasts

Skip Is Now Free and Open Source

Ireland wants to give its cops spyware, ability to crack encrypted messages

Vibecoding #2