vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

https://news.ycombinator.com/rss Hits: 2

Summary

Introduction In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s community of 1,969 contributors, authoring over 950 commits in the past month (as of 12/18/25). These efforts have been validated by vLLM’s inclusion in the SemiAnalysis open source InferenceMax performance benchmarks. In addition, vLLM is proud to be trusted in production by teams at Meta, LinkedIn, Red Hat, Mistral, and HuggingFace. DeepSeek-style disaggregated serving and sparse mixture-of-experts (MoE) model deployments remain state-of-the-art for high-performance LLM inference. This article outlines the key optimizations the vLLM team has built to push throughput even further, including: Async scheduling Dual-batch overlap Disaggregated serving CUDA graph mode FULL_AND_PIECEWISE DeepGEMM enabled by default DeepEP kernels integration Expert parallel load balancing SiLU kernel for DeepSeek-R1 For further reference, we recommend these excellent writeups by the llm-d, PyTorch, Dynamo, and Anyscale teams on large scale serving, disaggregated serving, distributed inference, and wide-EP using vLLM. Results Recent community benchmarks on a Coreweave H200 cluster connected using Infiniband with ConnectX-7 NICs now show a sustained throughput of 2.2k tokens/s per H200 GPU in production-like, multi-node deployments. This marks a significant increase over earlier benchmarks, which showed ~1.5k tokens/s per GPU. This gain is a direct result of ongoing optimization work, including kernel improvements (silu-mul-quant fusion, Cutlass QKV kernels, TP attention bug fixes) and the implementation of Dual Batch Overlap (DBO) for decode. This performance allows operators to realize immediate benefits by consolidating workloads and reducing the number of replicas needed for a target QPS, ultimately lowering token-per-dollar cost. Prefill Results Decode Results Key Components Wide-EP ...

First seen: 2026-01-14 01:07

Last seen: 2026-01-14 02:08

Read Full Article More from this Source

vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep

Summary

Related News

The Tulip Creative Computer

EOL hardware should mean open-source software

My first paper: A practical implementation of Rubiks cube based passkeys

Let's be honest, Generative AI isn't going all that well

Scott Adams has died