AVX-512: First Impressions on Performance and Programmability

https://news.ycombinator.com/rss Hits: 4

Summary

This is my attempt to explore the SIMD paradigm. I come at it as someone who has worked with other parallelization models- threads, distributed systems and GPUs, SIMD has been my one blind spot. For a while I was okay with that. My tech diet hasn’t been very kind to SIMD, AVX-512 in particular. There were reports about CPU heating and downclocking (probably not true anymore), and when the hardware did seem to work as promised, taking advantage of it from software wasn’t straightforward (this one is probably still true). My goal here is two-fold: 1) Performance: How much scaling we can actually get from all these extra lanes with reasonable development effort. Ideally, it should be 16x for single-precision. 2) Programmability: Contrasting SIMD way of thinking about parallel programs with SIMT (Single Instruction Multiple Threads), specifically CUDA. (SPMD is probably a better term, but I’ll stick with SIMT here) Benchmark Problem Finding a good problem for this is actually not that trivial. The number of problems that 1) can be meaningfully accelerated by SIMD and 2) quickly be explained for a blogpost is not very large. The issue is memory, which is often the bottleneck for interesting problems, but ideal SIMD speedup can only come from problems that are compute bound. Here’s arguably the most well-known example people use to introduce SIMD, including an interestingly titled talk from CppNow that I recently found- “How to Leverage SIMD Intrinsics for Massive Slowdowns”: void axpy_scalar(const float *a, const float *x, const float *b, float *out std::size_t n) { for (std::size_t i = 0; i < n; ++i) { out[i] = a[i] * x[i] + b[i]; } } The video talks about how explicit vectorization using intrinsics can lead to a slowdown compared to auto vectorization, due to things like loop unrolling, ILP etc. The problem is, it’s just a bad, bad example to talk about SIMD at all, regardless of whether it’s explicit or auto-vectorized. Take a guess: if we completely disable vectoriza...

First seen: 2026-01-19 03:29

Last seen: 2026-01-19 06:30

Read Full Article More from this Source

AVX-512: First Impressions on Performance and Programmability

Summary

Related News

Show HN: AWS-doctor – A terminal-based AWS health check and cost optimizer in Go

Experiments with Kafka's head-of-line blocking (2023)

Gaussian Splatting – A$AP Rocky Helicopter Music Video

A free and open-source rootkit for Linux

Poking holes into bytecode with peephole optimisations