Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How we optimized vLLM for DeepSeek-R1

March 19, 2025
Michael Goin Robert Shaw Nick Hill Tyler Smith Lucas Wilkinson
Related topics:
Artificial intelligenceOpen source
Related products:
Red Hat AI

Share:

    DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and discuss how we are scaling DeepSeek-R1 for real-world deployment. Additionally, we will review the various open source contributions from DeepSeek and outline our roadmap for integrating them into vLLM.

    Introduction to vLLM

    vLLM is an open source inference server designed for efficient model serving, providing a streamlined, high-performance infrastructure for large language models (LLMs). It is licensed under Apache 2.0 and can be installed via pip or deployed as a Docker image, making it easily accessible. It supports a variety of hardware platforms, including most accelerators and CPUs, ensuring wide compatibility across different infrastructures. Check out the extensive list here.

    Neural Magic, now part of Red Hat, is a top commercial contributor to vLLM, working extensively on model and systems optimizations to improve vLLM performance at scale. The framework supports multimodal models, embeddings, and reward modeling, and is increasingly used in reinforcement learning with human feedback (RLHF) workflows. With features such as advanced scheduling, chunk prefill, Multi-LoRA batching, and structured outputs, vLLM is optimized for both inference acceleration and enterprise-scale deployment.

    Learn more: Meet vLLM: For faster, more efficient LLM inference and serving

    DeepSeek-R1: A complex model

    DeepSeek-R1 has been making headlines for its exceptional reasoning capabilities and its novel architectural advancements. It introduces several technical challenges due to its complexity and scale.

    One of the defining characteristics of DeepSeek-R1 is its sheer size. With 256 experts in its MLP layers, over 671 billion parameters, and 720GBs in size, it surpasses previous models such as Mixtral, which only had 8 experts. The model is so large that it cannot fit even on an 8x H100 node, requiring innovative techniques for inference and deployment. Additionally, DeepSeek-R1 is one of the first foundation models trained using FP8 quantization, a novel approach that requires custom support for inference.

    Prior to DeepSeek-R1, our model and system optimization efforts focused primarily on Llama-style models. The introduction of DeepSeek-R1 has required significant modifications to accommodate its new architectural features. Over the past several weeks, we have worked on performance optimizations to enhance efficiency and reduce computational overhead. While we have made substantial progress, further work is required to refine and finalize production-ready implementations.

    An insightful chart, shared by UnSloth, illustrates the rapid pace of development in this space (Figure 1). This chart tracks the performance of various LLM serving frameworks, including those utilizing vLLM, and demonstrates how tokens per second have grown exponentially. The progress has been driven by our integration of Multi-Token Prediction (MTP), MLA, and torch.compile, among other advancements.

    Figure 1
    Figure 1: Performance gains in tokens per second across different LLM serving frameworks, highlighting optimizations in vLLM. Source: 2025-02-27 - vLLM Office Hours - DeepSeek and vLLM.

    Open Infra Week contributions

    DeepSeek’s Open Infra Week in February, 2025, introduced a series of inference kernel advancements aimed at accelerating model execution. Our team has focused on integrating these optimizations into vLLM and improving their performance.

    Key contributions from Open Infra Week include:

    • FlashMLA (Multi-Head Latent Attention): A kernel for MLA that increases speeds up batched decoding.
    • DPP (Dynamic Partitioning for Parallelism): A new method to balance computational loads across distributed environments.
    • Speculative decoding enhancements: Techniques that boost inference speed while maintaining accuracy.

    We continue to collaborate with the open source community through GitHub and Slack discussions, refining and integrating these optimizations into vLLM.

    MLA, multi-token prediction, and parallelism optimizations

    To fully optimize vLLM for DeepSeek-R1, we focused on 3 major areas: 

    • Multi-Head Latent Attention (MLA)
    • Multi-Token Prediction (MTP)
    • Parallelism strategies

    These optimizations ensure that vLLM can efficiently handle the computational demands of DeepSeek-R1.

    Multi-Head Latent Attention: Reducing KV cache bottlenecks

    MLA dramatically reduces KV cache size by projecting key-value heads into a compressed latent space. This optimization significantly decreases memory bandwidth usage, increasing the maximum token capacity from 67K to 650K and allowing for larger batch processing throughput.

    MLA: Impact on performance

    Reduced KV cache size enables significantly more batching and therefore throughput at inference time, as shown in Figure 2.

    Figure 2
    Figure 2: Comparison of KV cache size reduction using MLA in vLLM, showing increased token capacity. Source: 2025-02-27 - vLLM Office Hours - DeepSeek and vLLM.

    While MLA provides substantial benefits, its implementation presents challenges. Unlike traditional MHA, MLA involves varying Q, K, and V head dimensions, which many existing kernels do not support. We are actively working on integrating kernel-level optimizations to fully exploit MLA’s advantages.

    Multi-Token Prediction: Enhancing reasoning model performance

    DeepSeek-R1’s reasoning tasks require generating long sequences, making inference efficiency critical. MTP enables faster processing by predicting multiple tokens per step rather than one at a time. This leads to significant improvements in inference speed, particularly for workloads that involve extended decoding phases.

    The team at Meta implemented MTP over the course of the past couple of weeks. Tests show that MTP provides up to a 20% speed improvement in low QPS scenarios (see Figure 3).

    Figure 3
    Figure 3: End-to-end inference speed improvements using Multi-Token Prediction (MTP). Source: 2025-02-27 - vLLM Office Hours - DeepSeek and vLLM.

    Parallelism: Scaling DeepSeek-R1 efficiently

    DeepSeek-R1’s architecture requires parallelism strategies beyond traditional tensor parallelism. These are integrations called Expert Parallelism (EP) and Data Parallelism (DP) to improve performance.

    • Expert Parallelism (EP): Assigns specific experts to dedicated GPUs, ensuring efficient utilization and reducing redundancy.
    • Data Parallelism (DP): Distributes batched sequences between GPUs for the attention layers, avoiding KV cache duplication to improve memory efficiency.

    These techniques allow us to effectively distribute computational loads, leading to more scalable inference. Check out the Office Hours recording on Distributed Inference with vLLM.

    Future roadmap and next steps

    Looking forward, our primary goals are to finalize key optimizations and continue refining vLLM’s performance for DeepSeek-R1. Next steps include:

    • Finalizing V1 MLA and FlashMLA support to fully integrate DeepSeek’s attention optimizations.
    • Enhancing speculative decoding techniques to further reduce inference latency.
    • Optimizing multi-node parallelism strategies to better handle DeepSeek-R1’s immense parameter count.
    • Benchmarking additional DeepSeek inference kernels to validate efficiency improvements in real-world use cases.

    As part of our continued development, we will be conducting large-scale benchmarking experiments and collaborating with other teams to fine-tune vLLM’s infrastructure. By focusing on both algorithmic improvements and practical deployment strategies, we aim to make vLLM the most robust and scalable inference framework for large-scale models.

    Conclusion

    Our work optimizing vLLM for DeepSeek-R1 has been a significant effort, made possible through collaboration with Neural Magic (Red Hat), DeepSeek, Berkeley, Meta, and the broader open source community. Over the past few weeks, we have made substantial progress in improving inference performance through MLA, MTP, and advanced parallelism techniques.

    With optimizations that reduce KV cache size, enhance multi-token prediction, and introduce scalable parallelism strategies, vLLM is well-positioned to serve next-generation AI models efficiently. Looking ahead, our continued focus will be on refining and finalizing these improvements, ensuring vLLM remains the premier open source solution for deploying large-scale models.

    We are excited about the future and look forward to sharing further updates as we push the boundaries of inference performance for DeepSeek-R1!

    Last updated: March 31, 2025

    Related Posts

    • Deployment-ready reasoning with quantized DeepSeek-R1 models

    • vLLM V1: Accelerating multimodal inference for large language models

    • Multimodal model quantization support through LLM Compressor

    • Generative AI large language model prompt patterns: Tips for developers

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • Getting started with InstructLab for generative AI model tuning

    Recent Posts

    • How to encrypt RHEL images for Azure confidential VMs

    • How to manage RHEL virtual machines with Podman Desktop

    • Speech-to-text with Whisper and Red Hat AI Inference Server

    • How to use Splunk as an event source for Event-Driven Ansible

    • Integrate vLLM inference on macOS/iOS with Llama Stack APIs

    What’s up next?

    Learn how large language models (LLMs) are created and use Red Hat Enterprise Linux AI to experiment within an LLM in this hands-on learning path.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue