<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>vLLM Blog</title>
    <link>https://vllm.ai/blog</link>
    <description>Technical articles, release announcements, model guides, and community updates from the vLLM project.</description>
    <language>en-us</language>
    <lastBuildDate>Sun, 05 Apr 2026 00:16:27 GMT</lastBuildDate>
    <atom:link href="https://vllm.ai/blog/rss.xml" rel="self" type="application/rss+xml"/>
    <item>
      <title>Announcing Gemma 4 on vLLM: Byte for byte, the most capable open models</title>
      <link>https://vllm.ai/blog/gemma4</link>
      <guid isPermaLink="true">https://vllm.ai/blog/gemma4</guid>
      <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
      <description>With the debut of Gemma 4, vLLM introduces immediate support for Google&apos;s most sophisticated open model lineup, spanning multiple hardware backends, with first-ever Day 0 support on Google TPUs,...</description>
      <category>model-support</category>
      <dc:creator>Google Team</dc:creator>
    </item>
    <item>
      <title>Extracting hidden states from vLLM</title>
      <link>https://vllm.ai/blog/extract-hidden-states</link>
      <guid isPermaLink="true">https://vllm.ai/blog/extract-hidden-states</guid>
      <pubDate>Mon, 30 Mar 2026 00:00:00 GMT</pubDate>
      <description>PR \#33736 (included in vllm&gt;=v0.18.0) introduced a new hidden states extraction system to vLLM. This blog post explores the motivation, design, usage, and future direction of this feature, and...</description>
      <category>speculative-decoding</category>
      <dc:creator>Fynn Schmitt-Ulms</dc:creator>
    </item>
    <item>
      <title>Model Runner V2: A Modular and Faster Core for vLLM</title>
      <link>https://vllm.ai/blog/mrv2</link>
      <guid isPermaLink="true">https://vllm.ai/blog/mrv2</guid>
      <pubDate>Tue, 24 Mar 2026 00:00:00 GMT</pubDate>
      <description>We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...</description>
      <category>performance</category>
      <category>engineering</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM</title>
      <link>https://vllm.ai/blog/p-eagle</link>
      <guid isPermaLink="true">https://vllm.ai/blog/p-eagle</guid>
      <pubDate>Fri, 13 Mar 2026 00:00:00 GMT</pubDate>
      <description>EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...</description>
      <category>performance</category>
      <category>speculative-decoding</category>
      <dc:creator>Amazon and NVIDIA Team</dc:creator>
    </item>
    <item>
      <title>Run Highly Efficient and Accurate Multi-Agent AI with NVIDIA Nemotron 3 Super Using vLLM</title>
      <link>https://vllm.ai/blog/nemotron-3-super</link>
      <guid isPermaLink="true">https://vllm.ai/blog/nemotron-3-super</guid>
      <pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate>
      <description>We are excited to support the newly released NVIDIA Nemotron 3 Super model on vLLM.</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain</title>
      <link>https://vllm.ai/blog/v0.2-vllm-sr-athena-release</link>
      <guid isPermaLink="true">https://vllm.ai/blog/v0.2-vllm-sr-athena-release</guid>
      <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
      <description>Since v0.1 Iris, vLLM Semantic Router has made a large jump. In one release cycle, the project rebuilt its model stack, expanded routing into safety, semantic caching, memory, retrieval, and...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>vLLM Triton Attention Backend Deep Dive</title>
      <link>https://vllm.ai/blog/vllm-triton-backend-deep-dive</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-triton-backend-deep-dive</guid>
      <pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate>
      <description>This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....</description>
      <category>performance</category>
      <category>triton</category>
      <category>attention</category>
      <dc:creator>vLLM Team at IBM Research</dc:creator>
    </item>
    <item>
      <title>Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm</title>
      <link>https://vllm.ai/blog/rocm-attention-backend</link>
      <guid isPermaLink="true">https://vllm.ai/blog/rocm-attention-backend</guid>
      <pubDate>Fri, 27 Feb 2026 00:00:00 GMT</pubDate>
      <description>For a long time, enabling AMD support meant &quot;porting&quot;; i.e. just making code run. That era is over.</description>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>AMD and Embedded LLM</dc:creator>
    </item>
    <item>
      <title>Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock</title>
      <link>https://vllm.ai/blog/multi-lora</link>
      <guid isPermaLink="true">https://vllm.ai/blog/multi-lora</guid>
      <pubDate>Thu, 26 Feb 2026 00:00:00 GMT</pubDate>
      <description>Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...</description>
      <category>performance</category>
      <dc:creator>Danielle Maddix Robinson, Florian Saupe, George Novack, Haipeng Li, Mani Kumar Adari, Xiang Song, Yu Gong (AWS AI Team)</dc:creator>
    </item>
    <item>
      <title>DeepSeek-V3.2 on GB300: Performance Breakthrough</title>
      <link>https://vllm.ai/blog/gb300-deepseek</link>
      <guid isPermaLink="true">https://vllm.ai/blog/gb300-deepseek</guid>
      <pubDate>Fri, 13 Feb 2026 00:00:00 GMT</pubDate>
      <description>DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...</description>
      <category>hardware</category>
      <category>quantization</category>
      <category>performance</category>
      <dc:creator>The DaoCloud and vLLM team</dc:creator>
    </item>
    <item>
      <title>Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)</title>
      <link>https://vllm.ai/blog/dsr1-gb200-part1</link>
      <guid isPermaLink="true">https://vllm.ai/blog/dsr1-gb200-part1</guid>
      <pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate>
      <description>Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA&apos;s GB200 platform. This blog...</description>
      <category>large-scale-serving</category>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>Meta and NVIDIA Team</dc:creator>
    </item>
    <item>
      <title>GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier</title>
      <link>https://vllm.ai/blog/gpt-oss-optimizations</link>
      <guid isPermaLink="true">https://vllm.ai/blog/gpt-oss-optimizations</guid>
      <pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate>
      <description>TL;DR: In collaboration with the open-source community, vLLM \+ NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA&apos;s Blackwell GPUs. Through deep...</description>
      <category>performance</category>
      <category>hardware</category>
      <dc:creator>The vLLM and NVIDIA team</dc:creator>
    </item>
    <item>
      <title>Streaming Requests &amp; Realtime API in vLLM</title>
      <link>https://vllm.ai/blog/streaming-realtime</link>
      <guid isPermaLink="true">https://vllm.ai/blog/streaming-realtime</guid>
      <pubDate>Sat, 31 Jan 2026 00:00:00 GMT</pubDate>
      <description>Large language model inference has traditionally operated on a simple premise: the user submits a complete prompt (request), the model processes it, and returns a response (either streaming or at...</description>
      <category>multimodal</category>
      <dc:creator>Meta, Mistral AI as well as the vLLM team</dc:creator>
    </item>
    <item>
      <title>Building Mixture-of-Models on AMD GPUs with vLLM-SR</title>
      <link>https://vllm.ai/blog/mom-on-amd-gpu</link>
      <guid isPermaLink="true">https://vllm.ai/blog/mom-on-amd-gpu</guid>
      <pubDate>Fri, 23 Jan 2026 00:00:00 GMT</pubDate>
      <description>We are working on building the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems.</description>
      <category>hardware</category>
      <category>ecosystem</category>
      <dc:creator>The AMD and vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput</title>
      <link>https://vllm.ai/blog/kv-offloading-connector</link>
      <guid isPermaLink="true">https://vllm.ai/blog/kv-offloading-connector</guid>
      <pubDate>Thu, 08 Jan 2026 00:00:00 GMT</pubDate>
      <description>In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...</description>
      <category>performance</category>
      <dc:creator>Or Ozeri, Danny Harnik (vLLM Team at IBM Research)</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router v0.1 Iris: The First Major Release</title>
      <link>https://vllm.ai/blog/vllm-sr-iris</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-sr-iris</guid>
      <pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate>
      <description>vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers</title>
      <link>https://vllm.ai/blog/introducing-vllm-playground</link>
      <guid isPermaLink="true">https://vllm.ai/blog/introducing-vllm-playground</guid>
      <pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate>
      <description>As a passionate vLLM community member who wants to see vLLM thrive and reach even more developers, I&apos;m excited to announce vLLM Playground – a modern, feature-rich web interface for managing and...</description>
      <category>frontend</category>
      <category>ecosystem</category>
      <dc:creator>micytao</dc:creator>
    </item>
    <item>
      <title>Announcing vllm.ai Website and Some Community Updates</title>
      <link>https://vllm.ai/blog/vllm-ai-website</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-ai-website</guid>
      <pubDate>Sat, 27 Dec 2025 00:00:00 GMT</pubDate>
      <description>For a long time, vllm.ai simply redirected to the vLLM GitHub page. Thanks to our community, we now have a brand-new vllm.ai website, drawing inspiration from the PyTorch website.</description>
      <category>community</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>vLLM-Omni Diffusion Cache Acceleration</title>
      <link>https://vllm.ai/blog/vllm-omni-diffusion-cache-acceleration</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-omni-diffusion-cache-acceleration</guid>
      <pubDate>Fri, 19 Dec 2025 00:00:00 GMT</pubDate>
      <description>We are thrilled to announce a major performance update for vLLM-Omni.</description>
      <category>multimodal</category>
      <category>performance</category>
      <category>ecosystem</category>
      <dc:creator>vLLM-Omni Team</dc:creator>
    </item>
    <item>
      <title>vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP</title>
      <link>https://vllm.ai/blog/large-scale-serving</link>
      <guid isPermaLink="true">https://vllm.ai/blog/large-scale-serving</guid>
      <pubDate>Wed, 17 Dec 2025 00:00:00 GMT</pubDate>
      <description>In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...</description>
      <category>large-scale-serving</category>
      <category>performance</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>AMD × vLLM Semantic Router: Building the System Intelligence Together</title>
      <link>https://vllm.ai/blog/vllm-sr-amd</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-sr-amd</guid>
      <pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate>
      <description>Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in...</description>
      <category>hardware</category>
      <category>ecosystem</category>
      <dc:creator>The AMD and vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM</title>
      <link>https://vllm.ai/blog/run-nvidia-nemotron-3-nano</link>
      <guid isPermaLink="true">https://vllm.ai/blog/run-nvidia-nemotron-3-nano</guid>
      <pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate>
      <description>Jan 28th Update: NVIDIA just released their Nemotron 3 Nano model in NVFP4 precision. This model is supported by vLLM out of the box and it uses a new method called Quantization-Aware Distillation...</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>Encoder Disaggregation for Scalable Multimodal Model Serving</title>
      <link>https://vllm.ai/blog/vllm-epd</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-epd</guid>
      <pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate>
      <description>Modern Large Multimodal Models (LMMs) introduce a unique serving-time bottleneck: before any text generation can begin, all images must be processed by a visual encoder (e.g., ViT). This encoder...</description>
      <category>multimodal</category>
      <category>large-scale-serving</category>
      <dc:creator>Multimodality Workstream @ vLLM</dc:creator>
    </item>
    <item>
      <title>Token-Level Truth: Real-Time Hallucination Detection for Production LLMs</title>
      <link>https://vllm.ai/blog/halugate</link>
      <guid isPermaLink="true">https://vllm.ai/blog/halugate</guid>
      <pubDate>Sun, 14 Dec 2025 00:00:00 GMT</pubDate>
      <description>Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Diving into speculative decoding training support for vLLM with Speculators v0.3.0</title>
      <link>https://vllm.ai/blog/speculators-v030</link>
      <guid isPermaLink="true">https://vllm.ai/blog/speculators-v030</guid>
      <pubDate>Sat, 13 Dec 2025 00:00:00 GMT</pubDate>
      <description>- Speculative decoding serves as an optimization to improve inference performance; however, training a unique draft model for each LLM can be difficult and time-consuming, while production-ready...</description>
      <category>speculative-decoding</category>
      <category>ecosystem</category>
      <dc:creator>Fynn Schmitt-Ulms, Helen Zhao, Rahul Tuli and Dipika Sikka (Red Hat AI Model Optimization Team)</dc:creator>
    </item>
    <item>
      <title>vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving</title>
      <link>https://vllm.ai/blog/vllm-router-release</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-router-release</guid>
      <pubDate>Sat, 13 Dec 2025 00:00:00 GMT</pubDate>
      <description>Efficiently managing request distribution across a fleet of model replicas is a critical requirement for large-scale, production vLLM deployments. Standard load balancers often fall short as they...</description>
      <category>large-scale-serving</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor</title>
      <link>https://vllm.ai/blog/intel-autoround-llmc</link>
      <guid isPermaLink="true">https://vllm.ai/blog/intel-autoround-llmc</guid>
      <pubDate>Tue, 09 Dec 2025 00:00:00 GMT</pubDate>
      <description>Achieve faster, more efficient LLM serving without sacrificing accuracy!</description>
      <category>quantization</category>
      <category>hardware</category>
      <category>ecosystem</category>
      <dc:creator>Intel Neural Compressor Team, Red Hat AI Model Optimization Team</dc:creator>
    </item>
    <item>
      <title>Tracing Hanging and Complicated GPU Kernels Down To The Source Code</title>
      <link>https://vllm.ai/blog/improved-cuda-debugging</link>
      <guid isPermaLink="true">https://vllm.ai/blog/improved-cuda-debugging</guid>
      <pubDate>Wed, 03 Dec 2025 00:00:00 GMT</pubDate>
      <description>Several months ago, we published a blog post about CUDA Core Dump: An Effective Tool to Debug Memory Access Issues and Beyond, introducing a powerful technique for debugging illegal memory access...</description>
      <category>developer</category>
      <dc:creator>Kaichao You (vLLM)</dc:creator>
    </item>
    <item>
      <title>Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving</title>
      <link>https://vllm.ai/blog/vllm-omni</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-omni</guid>
      <pubDate>Sun, 30 Nov 2025 00:00:00 GMT</pubDate>
      <description>We are excited to announce the official release of vLLM-Omni, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.</description>
      <category>multimodal</category>
      <category>ecosystem</category>
      <dc:creator>vLLM-Omni Team</dc:creator>
    </item>
    <item>
      <title>Streamlined multi-node serving with Ray symmetric-run</title>
      <link>https://vllm.ai/blog/ray-symmetric-run</link>
      <guid isPermaLink="true">https://vllm.ai/blog/ray-symmetric-run</guid>
      <pubDate>Sat, 22 Nov 2025 00:00:00 GMT</pubDate>
      <description>Ray now has a new command: ray symmetric-run. This command makes it possible to launch the same entrypoint command on every node in a Ray cluster, simplifying the workflow to spawn vLLM servers...</description>
      <category>large-scale-serving</category>
      <dc:creator>Richard Liaw (Anyscale/Ray), Kaichao You (vLLM)</dc:creator>
    </item>
    <item>
      <title>Building Clean, Maintainable vLLM Modifications Using the Plugin System</title>
      <link>https://vllm.ai/blog/vllm-plugin-system</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-plugin-system</guid>
      <pubDate>Thu, 20 Nov 2025 00:00:00 GMT</pubDate>
      <description>Source: https://github.com/vllm-project/vllm-ascend</description>
      <category>developer</category>
      <dc:creator>Dhruvil Bhatt (AWS SageMaker)</dc:creator>
    </item>
    <item>
      <title>Docker Model Runner Integrates vLLM for High-Throughput Inferencing</title>
      <link>https://vllm.ai/blog/docker-model-runner-vllm</link>
      <guid isPermaLink="true">https://vllm.ai/blog/docker-model-runner-vllm</guid>
      <pubDate>Wed, 19 Nov 2025 00:00:00 GMT</pubDate>
      <description>Today, we&apos;re excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you...</description>
      
      <dc:creator>Docker Team</dc:creator>
    </item>
    <item>
      <title>Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale</title>
      <link>https://vllm.ai/blog/signal-decision</link>
      <guid isPermaLink="true">https://vllm.ai/blog/signal-decision</guid>
      <pubDate>Wed, 19 Nov 2025 00:00:00 GMT</pubDate>
      <description>The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems</title>
      <link>https://vllm.ai/blog/shm-ipc-cache</link>
      <guid isPermaLink="true">https://vllm.ai/blog/shm-ipc-cache</guid>
      <pubDate>Thu, 13 Nov 2025 00:00:00 GMT</pubDate>
      <description>Introducing Shared Memory IPC Caching — a high-performance caching mechanism contributed by Cohere to the vLLM project. By bypassing redundant inter-process communication and keeping large...</description>
      <category>performance</category>
      <category>multimodal</category>
      <dc:creator>Donglu Wang (Cohere)</dc:creator>
    </item>
    <item>
      <title>Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM</title>
      <link>https://vllm.ai/blog/intel-arc-pro-b</link>
      <guid isPermaLink="true">https://vllm.ai/blog/intel-arc-pro-b</guid>
      <pubDate>Tue, 11 Nov 2025 00:00:00 GMT</pubDate>
      <description>Intel® Arc™ Pro B-Series GPU Family GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability...</description>
      <category>hardware</category>
      <dc:creator>Intel vLLM Team</dc:creator>
    </item>
    <item>
      <title>No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan</title>
      <link>https://vllm.ai/blog/bitwise-consistent-train-inference</link>
      <guid isPermaLink="true">https://vllm.ai/blog/bitwise-consistent-train-inference</guid>
      <pubDate>Mon, 10 Nov 2025 00:00:00 GMT</pubDate>
      <description>We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as the inference engine. Built on top of vLLM&apos;s recent work on batch-invariant...</description>
      <category>performance</category>
      <dc:creator>vLLM and TorchTitan Teams</dc:creator>
    </item>
    <item>
      <title>Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM</title>
      <link>https://vllm.ai/blog/run-multimodal-reasoning-agents-nvidia-nemotron</link>
      <guid isPermaLink="true">https://vllm.ai/blog/run-multimodal-reasoning-agents-nvidia-nemotron</guid>
      <pubDate>Fri, 31 Oct 2025 00:00:00 GMT</pubDate>
      <description>We are excited to release NVIDIA Nemotron Nano 2 VL, supported by vLLM. This open vision language model (VLM) is built for video understanding and document intelligence.</description>
      <category>model-support</category>
      <category>multimodal</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2&apos;s Tool-Calling on vLLM</title>
      <link>https://vllm.ai/blog/Kimi-K2-Accuracy</link>
      <guid isPermaLink="true">https://vllm.ai/blog/Kimi-K2-Accuracy</guid>
      <pubDate>Tue, 28 Oct 2025 00:00:00 GMT</pubDate>
      <description>TL;DR: For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd (Kimi-K2-0905) or commit...</description>
      <category>model-support</category>
      <category>developer</category>
      <dc:creator>Linian Wang (Peking University)</dc:creator>
    </item>
    <item>
      <title>From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA</title>
      <link>https://vllm.ai/blog/semantic-router-modular</link>
      <guid isPermaLink="true">https://vllm.ai/blog/semantic-router-modular</guid>
      <pubDate>Mon, 27 Oct 2025 00:00:00 GMT</pubDate>
      <description>Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number...</description>
      <category>ecosystem</category>
      <dc:creator>Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)</dc:creator>
    </item>
    <item>
      <title>Zero-Reload Model Switching with vLLM Sleep Mode</title>
      <link>https://vllm.ai/blog/sleep-mode</link>
      <guid isPermaLink="true">https://vllm.ai/blog/sleep-mode</guid>
      <pubDate>Sun, 26 Oct 2025 00:00:00 GMT</pubDate>
      <description>The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:</description>
      <category>performance</category>
      <dc:creator>Embedded LLM</dc:creator>
    </item>
    <item>
      <title>Now Serving NVIDIA Nemotron with vLLM</title>
      <link>https://vllm.ai/blog/now_serving_nvidia_nemotron_with_vllm</link>
      <guid isPermaLink="true">https://vllm.ai/blog/now_serving_nvidia_nemotron_with_vllm</guid>
      <pubDate>Thu, 23 Oct 2025 00:00:00 GMT</pubDate>
      <description>Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are...</description>
      <category>model-support</category>
      <dc:creator>NVIDIA Nemotron Team</dc:creator>
    </item>
    <item>
      <title>No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL</title>
      <link>https://vllm.ai/blog/agent-lightning</link>
      <guid isPermaLink="true">https://vllm.ai/blog/agent-lightning</guid>
      <pubDate>Wed, 22 Oct 2025 00:00:00 GMT</pubDate>
      <description>TL;DR. Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In agent RL, this can lead to inconsistencies between training and...</description>
      
      <dc:creator>The Agent Lightning (AGL) Team</dc:creator>
    </item>
    <item>
      <title>vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU</title>
      <link>https://vllm.ai/blog/vllm-tpu</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-tpu</guid>
      <pubDate>Thu, 16 Oct 2025 00:00:00 GMT</pubDate>
      <description>vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path. It is not only faster than the previous generation...</description>
      <category>hardware</category>
      <dc:creator>Google Team</dc:creator>
    </item>
    <item>
      <title>SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference</title>
      <link>https://vllm.ai/blog/blackwell-inferencemax</link>
      <guid isPermaLink="true">https://vllm.ai/blog/blackwell-inferencemax</guid>
      <pubDate>Thu, 09 Oct 2025 00:00:00 GMT</pubDate>
      <description>Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200) for large language model...</description>
      <category>hardware</category>
      <category>performance</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action</title>
      <link>https://vllm.ai/blog/deepseek-v3-2</link>
      <guid isPermaLink="true">https://vllm.ai/blog/deepseek-v3-2</guid>
      <pubDate>Mon, 29 Sep 2025 00:00:00 GMT</pubDate>
      <description>We are excited to announce Day 0 support for DeepSeek-V3.2-Exp, featuring DeepSeek Sparse Attention (DSA) (paper) designed for long context tasks. In this post, we showcase how to use this model...</description>
      <category>model-support</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>The First vLLM Meetup in Korea</title>
      <link>https://vllm.ai/blog/vllm-meetup</link>
      <guid isPermaLink="true">https://vllm.ai/blog/vllm-meetup</guid>
      <pubDate>Tue, 16 Sep 2025 00:00:00 GMT</pubDate>
      <description>The first vLLM meetup in Korea was held on August 19, 2025, in Seoul, hosted by Rebellions and Red Hat with support from PyTorch Korea User Group and SqueezeBits.</description>
      <category>community</category>
      <dc:creator>vLLM Team</dc:creator>
    </item>
    <item>
      <title>vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency</title>
      <link>https://vllm.ai/blog/qwen3-next</link>
      <guid isPermaLink="true">https://vllm.ai/blog/qwen3-next</guid>
      <pubDate>Thu, 11 Sep 2025 00:00:00 GMT</pubDate>
      <description>We’re excited to announce that vLLM now supports Qwen3-Next, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a hybrid architecture with extreme efficiency for...</description>
      <category>model-support</category>
      <dc:creator>The vLLM Team</dc:creator>
    </item>
    <item>
      <title>vLLM Semantic Router: Next Phase in LLM inference</title>
      <link>https://vllm.ai/blog/semantic-router</link>
      <guid isPermaLink="true">https://vllm.ai/blog/semantic-router</guid>
      <pubDate>Thu, 11 Sep 2025 00:00:00 GMT</pubDate>
      <description>Over the past year, hybrid reasoning and automatic routing have increasingly defined progress in large-model infrastructure—shifting the debate from raw scale to per-token efficiency, latency...</description>
      <category>ecosystem</category>
      <dc:creator>vLLM Semantic Router Team</dc:creator>
    </item>
    <item>
      <title>Inside vLLM: Anatomy of a High-Throughput LLM Inference System</title>
      <link>https://vllm.ai/blog/anatomy-of-vllm</link>
      <guid isPermaLink="true">https://vllm.ai/blog/anatomy-of-vllm</guid>
      <pubDate>Fri, 05 Sep 2025 00:00:00 GMT</pubDate>
      <description>In this post, I&apos;ll gradually introduce all of the core system components and advanced features that make up a modern high-throughput LLM inference system. In particular I&apos;ll be doing a breakdown...</description>
      <category>large-scale-serving</category>
      <category>speculative-decoding</category>
      <dc:creator>Aleksa Gordic</dc:creator>
    </item>
    <item>
      <title>Serving Geospatial, Vision, and Beyond: Enabling Multimodal Output Processing in vLLM</title>
      <link>https://vllm.ai/blog/beyond-text-generation</link>
      <guid isPermaLink="true">https://vllm.ai/blog/beyond-text-generation</guid>
      <pubDate>Fri, 05 Sep 2025 00:00:00 GMT</pubDate>
      <description>Until recently, generative AI infrastructure has been tightly coupled with autoregressive text generation models that produce output token-by-token, typically in the form of natural language. vLLM...</description>
      <category>multimodal</category>
      <dc:creator>Christian Pinto (IBM Research Europe - Dublin), Michele Gazzetti (IBM Research Europe - Dublin), Michael Johnston (IBM Research Europe - Dublin), Maximilien Philippe Marie de Bayser (IBM Research - Brazil)</dc:creator>
    </item>
  </channel>
</rss>