vLLM Blog

vLLM Blog https://vllm.ai/blog Technical articles, release announcements, model guides, and community updates from the vLLM project. en-us Sun, 05 Apr 2026 00:16:27 GMT Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models https://vllm.ai/blog/gemma4 https://vllm.ai/blog/gemma4 Thu, 02 Apr 2026 00:00:00 GMT With the debut of Gemma 4, vLLM introduces immediate support for Google's most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,... model-support Google Team Extracting hidden states from vLLM https://vllm.ai/blog/extract-hidden-states https://vllm.ai/blog/extract-hidden-states Mon, 30 Mar 2026 00:00:00 GMT PR \#33736 (included in vllm>=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and... speculative-decoding Fynn Schmitt-Ulms Model Runner V2: A Modular and Faster Core for vLLM https://vllm.ai/blog/mrv2 https://vllm.ai/blog/mrv2 Tue, 24 Mar 2026 00:00:00 GMT We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API... performance engineering vLLM Team P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM https://vllm.ai/blog/p-eagle https://vllm.ai/blog/p-eagle Fri, 13 Mar 2026 00:00:00 GMT EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you... performance speculative-decoding Amazon and NVIDIA Team Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM https://vllm.ai/blog/nemotron-3-super https://vllm.ai/blog/nemotron-3-super Wed, 11 Mar 2026 00:00:00 GMT We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM. model-support NVIDIA Nemotron Team vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain https://vllm.ai/blog/v0.2-vllm-sr-athena-release https://vllm.ai/blog/v0.2-vllm-sr-athena-release Tue, 10 Mar 2026 00:00:00 GMT Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and... ecosystem vLLM Semantic Router Team vLLM Triton Attention Backend Deep Dive https://vllm.ai/blog/vllm-triton-backend-deep-dive https://vllm.ai/blog/vllm-triton-backend-deep-dive Wed, 04 Mar 2026 00:00:00 GMT This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend.... performance triton attention vLLM Team at IBM Research Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm https://vllm.ai/blog/rocm-attention-backend https://vllm.ai/blog/rocm-attention-backend Fri, 27 Feb 2026 00:00:00 GMT For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over. performance hardware AMD and Embedded LLM Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock https://vllm.ai/blog/multi-lora https://vllm.ai/blog/multi-lora Thu, 26 Feb 2026 00:00:00 GMT Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the... performance Danielle Maddix Robinson, Florian Saupe, George Novack, Haipeng Li, Mani Kumar Adari, Xiang Song, Yu Gong (AWS AI Team) DeepSeek-V3.2 on GB300: Performance Breakthrough https://vllm.ai/blog/gb300-deepseek https://vllm.ai/blog/gb300-deepseek Fri, 13 Feb 2026 00:00:00 GMT DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /... hardware quantization performance The DaoCloud and vLLM team Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I) https://vllm.ai/blog/dsr1-gb200-part1 https://vllm.ai/blog/dsr1-gb200-part1 Tue, 03 Feb 2026 00:00:00 GMT Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog... large-scale-serving performance hardware Meta and NVIDIA Team GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier https://vllm.ai/blog/gpt-oss-optimizations https://vllm.ai/blog/gpt-oss-optimizations Sun, 01 Feb 2026 00:00:00 GMT TL;DR: In collaboration with the open-source community, vLLM \+ NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep... performance hardware The vLLM and NVIDIA team Streaming Requests & Realtime API in vLLM https://vllm.ai/blog/streaming-realtime https://vllm.ai/blog/streaming-realtime Sat, 31 Jan 2026 00:00:00 GMT Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at... multimodal Meta, Mistral AI as well as the vLLM team Building Mixture-of-Models on AMD GPUs with vLLM-SR https://vllm.ai/blog/mom-on-amd-gpu https://vllm.ai/blog/mom-on-amd-gpu Fri, 23 Jan 2026 00:00:00 GMT We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. hardware ecosystem The AMD and vLLM Semantic Router Team Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput https://vllm.ai/blog/kv-offloading-connector https://vllm.ai/blog/kv-offloading-connector Thu, 08 Jan 2026 00:00:00 GMT In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall... performance Or Ozeri, Danny Harnik (vLLM Team at IBM Research) vLLM Semantic Router v0.1 Iris: The First Major Release https://vllm.ai/blog/vllm-sr-iris https://vllm.ai/blog/vllm-sr-iris Mon, 05 Jan 2026 00:00:00 GMT vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from... ecosystem vLLM Semantic Router Team Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers https://vllm.ai/blog/introducing-vllm-playground https://vllm.ai/blog/introducing-vllm-playground Fri, 02 Jan 2026 00:00:00 GMT As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I'm excited to announce vLLM Playground – a modern, feature-rich web interface for managing and... frontend ecosystem micytao Announcing vllm.ai Website and Some Community Updates https://vllm.ai/blog/vllm-ai-website https://vllm.ai/blog/vllm-ai-website Sat, 27 Dec 2025 00:00:00 GMT For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website. community vLLM Team vLLM-Omni Diffusion Cache Acceleration https://vllm.ai/blog/vllm-omni-diffusion-cache-acceleration https://vllm.ai/blog/vllm-omni-diffusion-cache-acceleration Fri, 19 Dec 2025 00:00:00 GMT We are thrilled to announce a major performance update for vLLM-Omni. multimodal performance ecosystem vLLM-Omni Team vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP https://vllm.ai/blog/large-scale-serving https://vllm.ai/blog/large-scale-serving Wed, 17 Dec 2025 00:00:00 GMT In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s... large-scale-serving performance vLLM Team AMD × vLLM Semantic Router: Building the System Intelligence Together https://vllm.ai/blog/vllm-sr-amd https://vllm.ai/blog/vllm-sr-amd Tue, 16 Dec 2025 00:00:00 GMT Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in... hardware ecosystem The AMD and vLLM Semantic Router Team Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM https://vllm.ai/blog/run-nvidia-nemotron-3-nano https://vllm.ai/blog/run-nvidia-nemotron-3-nano Mon, 15 Dec 2025 00:00:00 GMT Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation... model-support NVIDIA Nemotron Team Encoder Disaggregation for Scalable Multimodal Model Serving https://vllm.ai/blog/vllm-epd https://vllm.ai/blog/vllm-epd Mon, 15 Dec 2025 00:00:00 GMT Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder... multimodal large-scale-serving Multimodality Workstream @ vLLM Token-Level Truth: Real-Time Hallucination Detection for Production LLMs https://vllm.ai/blog/halugate https://vllm.ai/blog/halugate Sun, 14 Dec 2025 00:00:00 GMT Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right... ecosystem vLLM Semantic Router Team Diving into speculative decoding training support for vLLM with Speculators v0.3.0 https://vllm.ai/blog/speculators-v030 https://vllm.ai/blog/speculators-v030 Sat, 13 Dec 2025 00:00:00 GMT - Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready... speculative-decoding ecosystem Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team) vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving https://vllm.ai/blog/vllm-router-release https://vllm.ai/blog/vllm-router-release Sat, 13 Dec 2025 00:00:00 GMT Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they... large-scale-serving vLLM Team Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor https://vllm.ai/blog/intel-autoround-llmc https://vllm.ai/blog/intel-autoround-llmc Tue, 09 Dec 2025 00:00:00 GMT Achieve faster, more efficient LLM serving without sacrificing accuracy! quantization hardware ecosystem Intel Neural Compressor Team, Red Hat AI Model Optimization Team Tracing Hanging and Complicated GPU Kernels Down To The Source Code https://vllm.ai/blog/improved-cuda-debugging https://vllm.ai/blog/improved-cuda-debugging Wed, 03 Dec 2025 00:00:00 GMT Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access... developer Kaichao You (vLLM) Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving https://vllm.ai/blog/vllm-omni https://vllm.ai/blog/vllm-omni Sun, 30 Nov 2025 00:00:00 GMT We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models. multimodal ecosystem vLLM-Omni Team Streamlined multi-node serving with Ray symmetric-run https://vllm.ai/blog/ray-symmetric-run https://vllm.ai/blog/ray-symmetric-run Sat, 22 Nov 2025 00:00:00 GMT Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers... large-scale-serving Richard Liaw (Anyscale/Ray), Kaichao You (vLLM) Building Clean, Maintainable vLLM Modifications Using the Plugin System https://vllm.ai/blog/vllm-plugin-system https://vllm.ai/blog/vllm-plugin-system Thu, 20 Nov 2025 00:00:00 GMT Source: https://github.com/vllm-project/vllm-ascend developer Dhruvil Bhatt (AWS SageMaker) Docker Model Runner Integrates vLLM for High-Throughput Inferencing https://vllm.ai/blog/docker-model-runner-vllm https://vllm.ai/blog/docker-model-runner-vllm Wed, 19 Nov 2025 00:00:00 GMT Today, we're excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you... Docker Team Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale https://vllm.ai/blog/signal-decision https://vllm.ai/blog/signal-decision Wed, 19 Nov 2025 00:00:00 GMT The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then... ecosystem vLLM Semantic Router Team Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems https://vllm.ai/blog/shm-ipc-cache https://vllm.ai/blog/shm-ipc-cache Thu, 13 Nov 2025 00:00:00 GMT Introducing Shared Memory IPC Caching — a high-performance caching mechanism contributed by Cohere to the vLLM project. By bypassing redundant inter-process communication and keeping large... performance multimodal Donglu Wang (Cohere) Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM https://vllm.ai/blog/intel-arc-pro-b https://vllm.ai/blog/intel-arc-pro-b Tue, 11 Nov 2025 00:00:00 GMT Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability... hardware Intel vLLM Team No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan https://vllm.ai/blog/bitwise-consistent-train-inference https://vllm.ai/blog/bitwise-consistent-train-inference Mon, 10 Nov 2025 00:00:00 GMT We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as the inference engine. Built on top of vLLM's recent work on batch-invariant... performance vLLM and TorchTitan Teams Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM https://vllm.ai/blog/run-multimodal-reasoning-agents-nvidia-nemotron https://vllm.ai/blog/run-multimodal-reasoning-agents-nvidia-nemotron Fri, 31 Oct 2025 00:00:00 GMT We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding and document intelligence. model-support multimodal NVIDIA Nemotron Team Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM https://vllm.ai/blog/Kimi-K2-Accuracy https://vllm.ai/blog/Kimi-K2-Accuracy Tue, 28 Oct 2025 00:00:00 GMT TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd (Kimi-K2-0905) or commit... model-support developer Linian Wang (Peking University) From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA https://vllm.ai/blog/semantic-router-modular https://vllm.ai/blog/semantic-router-modular Mon, 27 Oct 2025 00:00:00 GMT Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number... ecosystem Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent) Zero-Reload Model Switching with vLLM Sleep Mode https://vllm.ai/blog/sleep-mode https://vllm.ai/blog/sleep-mode Sun, 26 Oct 2025 00:00:00 GMT The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff: performance Embedded LLM Now Serving NVIDIA Nemotron with vLLM https://vllm.ai/blog/now_serving_nvidia_nemotron_with_vllm https://vllm.ai/blog/now_serving_nvidia_nemotron_with_vllm Thu, 23 Oct 2025 00:00:00 GMT Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are... model-support NVIDIA Nemotron Team No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL https://vllm.ai/blog/agent-lightning https://vllm.ai/blog/agent-lightning Wed, 22 Oct 2025 00:00:00 GMT TL;DR. Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In agent RL, this can lead to inconsistencies between training and... The Agent Lightning (AGL) Team vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU https://vllm.ai/blog/vllm-tpu https://vllm.ai/blog/vllm-tpu Thu, 16 Oct 2025 00:00:00 GMT vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path. It is not only faster than the previous generation... hardware Google Team SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference https://vllm.ai/blog/blackwell-inferencemax https://vllm.ai/blog/blackwell-inferencemax Thu, 09 Oct 2025 00:00:00 GMT Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200) for large language model... hardware performance vLLM Team DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action https://vllm.ai/blog/deepseek-v3-2 https://vllm.ai/blog/deepseek-v3-2 Mon, 29 Sep 2025 00:00:00 GMT We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DSA) (paper) designed for long context tasks. In this post, we showcase how to use this model... model-support vLLM Team The First vLLM Meetup in Korea https://vllm.ai/blog/vllm-meetup https://vllm.ai/blog/vllm-meetup Tue, 16 Sep 2025 00:00:00 GMT The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with support from PyTorch Korea User Group and SqueezeBits. community vLLM Team vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency https://vllm.ai/blog/qwen3-next https://vllm.ai/blog/qwen3-next Thu, 11 Sep 2025 00:00:00 GMT We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with extreme efficiency for... model-support The vLLM Team vLLM Semantic Router: Next Phase in LLM inference https://vllm.ai/blog/semantic-router https://vllm.ai/blog/semantic-router Thu, 11 Sep 2025 00:00:00 GMT Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency... ecosystem vLLM Semantic Router Team Inside vLLM: Anatomy of a High-Throughput LLM Inference System https://vllm.ai/blog/anatomy-of-vllm https://vllm.ai/blog/anatomy-of-vllm Fri, 05 Sep 2025 00:00:00 GMT In this post, I'll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I'll be doing a breakdown... large-scale-serving speculative-decoding Aleksa Gordic Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM https://vllm.ai/blog/beyond-text-generation https://vllm.ai/blog/beyond-text-generation Fri, 05 Sep 2025 00:00:00 GMT Until recently, generative AI infrastructure has been tightly coupled with autoregressive text generation models that produce output token-by-token, typically in the form of natural language. vLLM... multimodal Christian Pinto (IBM Research Europe - Dublin), Michele Gazzetti (IBM Research Europe - Dublin), Michael Johnston (IBM Research Europe - Dublin), Maximilien Philippe Marie de Bayser (IBM Research - Brazil)