Three types of LLM workloads and how to serve them

https://news.ycombinator.com/rss Hits: 2
Summary

We hold this truth to be self-evident: not all workloads are created equal. But for large language models, this truth is far from universally acknowledged. Most organizations building LLM applications get their AI from an API, and these APIs hide the varied costs and engineering trade-offs of distinct workloads behind deceptively flat per-token pricing. The truth, however, will out. The era of model API dominance is ending, thanks to excellent work on open source models by DeepSeek and Alibaba Qwen (eroding the benefits of proprietary model APIs like OpenAI's) and excellent work on open source inference engines like vLLM and SGLang (eroding the benefits of open model APIs powered by proprietary inference engines). Engineers who wish to take advantage of this technological change must understand their workloads in greater detail in order to properly architect and optimize their systems. In this document, we'll walk through the workloads and requirements we've seen in the market, working with leading organizations deploying inference to production at scale. We'll explain the challenges LLM engineers face when building for these workloads and how they solve those challenges. And we'll share a bit about how you can implement those solutions on our cloud platform. The breakdown: offline, online, and semi-online Gallia est omnis divisor in partes tres. - G. Julius Caesar, De Bello Gallico In the more mature world of databases, there is a well-known split between transaction processing (OLTP, think "shopping carts") and analytical processing (OLAP, think "Year Wrapped"). In between are hybrid workloads (HTAP) with the characteristics of both. A similar three-part division has helped us organize LLM workloads: offline or analytical workloads, which operate in batch mode, write to data stores asynchronously, and demand throughput above all else, online or interactive workloads, which operate in streaming mode, communicate synchronously with humans, and demand low latency, an...

First seen: 2026-01-21 21:41

Last seen: 2026-01-21 22:41