Keeping 20k GPUs Healthy

https://news.ycombinator.com/rss Hits: 3
Summary

Modal runs a globally distributed, autoscaling GPU worker pool by sourcing compute from all cloud giants: AWS, GCP, Azure, OCI. We鈥檝e scaled the worker pool to well over 20,000 concurrent GPUs, and launched over four million cloud instances in the last couple years. At this scale, you see almost every GPU reliability problem there is. Today, we鈥檙e sharing our GPU reliability system as both a demonstration of our commitment to Modal customers and as a guide for fellow travelers renting hyperscaler or neocloud cards. It鈥檚 dangerous to go alone! Take this. This post starts with cloud instance type testing and selection. Perhaps surprisingly, there are significant performance and reliability differences between the cloud hyperscalers. We then discuss machine image preparation and instance boot checks. Next we cover the passive and active GPU healthchecking performed throughout the life of each instance. Finally we discuss observability and support, which become crucial when a GPU reliability issue slips by our automated healthchecking systems. We鈥檝e chosen not to refer to cloud providers directly, but instead give them anonymized A, B, C, D identifiers. If you want know who鈥檚 who, track the clues or buy us a beer sometime. Instance type testing and selection Let鈥檚 start with cloud instance type reliability. The hyperscalers are significantly differentiated at the instance type level. To stick specifically to reliability related differences, we鈥檝e observed that: Cloud A has the simplest and most reliable instance launch API. If you request a BM or VM and get a HTTP 201 back, 99.6% of the time you鈥檒l get it to boot, and boot relatively quickly (2-3 minutes). Cloud A runs H100s which perform 50% worse on StableDiffusion text2img compared with C and D. Cloud C ran their H100s too hot, sometimes reaching over 90潞C, for a few months in 2025. FLOP/s performance degrades starting as low as the mid-70s Celsius. Cloud C has 228MiB more reserved H100 memory than the others. Thus, ...

First seen: 2026-01-22 19:45

Last seen: 2026-01-22 21:45