vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

https://news.ycombinator.com/rss Hits: 2
Summary

Introduction In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM鈥檚 community of 1,969 contributors, authoring over 950 commits in the past month (as of 12/18/25). These efforts have been validated by vLLM鈥檚 inclusion in the SemiAnalysis open source InferenceMax performance benchmarks. In addition, vLLM is proud to be trusted in production by teams at Meta, LinkedIn, Red Hat, Mistral, and HuggingFace. DeepSeek-style disaggregated serving and sparse mixture-of-experts (MoE) model deployments remain state-of-the-art for high-performance LLM inference. This article outlines the key optimizations the vLLM team has built to push throughput even further, including: Async scheduling Dual-batch overlap Disaggregated serving CUDA graph mode FULL_AND_PIECEWISE DeepGEMM enabled by default DeepEP kernels integration Expert parallel load balancing SiLU kernel for DeepSeek-R1 For further reference, we recommend these excellent writeups by the llm-d, PyTorch, Dynamo, and Anyscale teams on large scale serving, disaggregated serving, distributed inference, and wide-EP using vLLM. Results Recent community benchmarks on a Coreweave H200 cluster connected using Infiniband with ConnectX-7 NICs now show a sustained throughput of 2.2k tokens/s per H200 GPU in production-like, multi-node deployments. This marks a significant increase over earlier benchmarks, which showed ~1.5k tokens/s per GPU. This gain is a direct result of ongoing optimization work, including kernel improvements (silu-mul-quant fusion, Cutlass QKV kernels, TP attention bug fixes) and the implementation of Dual Batch Overlap (DBO) for decode. This performance allows operators to realize immediate benefits by consolidating workloads and reducing the number of replicas needed for a target QPS, ultimately lowering token-per-dollar cost. Prefill Results Decode Results Key Components Wide-EP ...

First seen: 2026-01-14 01:07

Last seen: 2026-01-14 02:08