Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2026
  • [2026/04] PyPI has expanded the storage quota for LMDeploy and wheel uploads have resumed. v0.12.3 is now available on PyPI, so you can install it directly via pip install lmdeploy.
  • [2026/02] Support Qwen3.5
  • [2026/02] Support vllm-project/llm-compressor 4bit symmetric/asymmetric quantization. Refer here for detailed guide
2025
  • [2025/09] TurboMind supports MXFP4 on NVIDIA GPUs starting from V100, achieving 1.5x the performmance of vLLM on H800 for openai gpt-oss models!
  • [2025/06] Comprehensive inference optimization for FP8 MoE Models
  • [2025/06] DeepSeek PD Disaggregation deployment is now supported through integration with DLSlime and Mooncake. Huge thanks to both teams!
  • [2025/04] Enhance DeepSeek inference performance by integration deepseek-ai techniques: FlashMLA, DeepGemm, DeepEP, MicroBatch and eplb
  • [2025/01] Support DeepSeek V3 and R1
2024
  • [2024/11] Support Mono-InternVL with PyTorch engine
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM3 (8B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Qwen2-MoE (57BA14B)
  • Qwen2.5 (0.5B - 32B)
  • Qwen3, Qwen3-MoE
  • Qwen3-Next(80B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM-4 (9B)
  • GLM-4-0414 (9B, 32B)
  • CodeGeeX4 (9B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • DeepSeek-V2.5 (236B)
  • DeepSeek-V3 (685B)
  • DeepSeek-V3.2 (685B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • Phi-4-mini (3.8B)
  • MiniCPM3 (4B)
  • SDAR (1.7B-30B)
  • gpt-oss (20B, 120B)
  • GLM-4.7-Flash (30B)
  • GLM-5 (754B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • Qwen2.5-VL (3B, 7B, 72B)
  • Qwen3-VL (2B - 235B)
  • Qwen3.5 (0.8B - 397B)
  • Qwen3-Omni (30B-A3B)
  • DeepSeek-VL (7B)
  • DeepSeek-VL2 (3B, 16B, 27B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • InternVL2.5(MPO) (1B-78B)
  • InternVL3 (1B-78B)
  • InternVL3.5 (1B-241BA28B)
  • Intern-S1 (241B)
  • Intern-S1-mini (8.3B)
  • Intern-S1-Pro (1TB)
  • Intern-S2-Preview (35B-A3B)
  • Mono-InternVL (2B)
  • ChemVLM (8B-26B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • GLM-4.1V-Thinking (9B)
  • Llama3.2-vision (11B, 90B)
  • Molmo (7B-D,72B)
  • Gemma3 (1B - 27B)
  • Llama4 (Scout, Maverick)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.10 - 3.13):

conda create -n lmdeploy python=3.12 -y
conda activate lmdeploy
pip install lmdeploy

Starting from v0.13.0, the default prebuilt wheels published on PyPI are built against CUDA 12.8, so pip install lmdeploy is sufficient for typical setups including GeForce RTX 50 series.

Offline Batch Inference

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}
@article{zhang2025efficient,
  title={Efficient Mixed-Precision Large Language Model Inference with TurboMind},
  author={Zhang, Li and Jiang, Youhe and He, Guoliang and Chen, Xin and Lv, Han and Yao, Qian and Fu, Fangcheng and Chen, Kai},
  journal={arXiv preprint arXiv:2508.15601},
  year={2025}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmdeploy-0.14.0-cp313-cp313-win_amd64.whl (54.9 MB view details)

Uploaded CPython 3.13Windows x86-64

lmdeploy-0.14.0-cp313-cp313-manylinux2014_x86_64.whl (90.8 MB view details)

Uploaded CPython 3.13

lmdeploy-0.14.0-cp312-cp312-win_amd64.whl (54.9 MB view details)

Uploaded CPython 3.12Windows x86-64

lmdeploy-0.14.0-cp312-cp312-manylinux2014_x86_64.whl (90.8 MB view details)

Uploaded CPython 3.12

lmdeploy-0.14.0-cp311-cp311-win_amd64.whl (54.9 MB view details)

Uploaded CPython 3.11Windows x86-64

lmdeploy-0.14.0-cp311-cp311-manylinux2014_x86_64.whl (90.8 MB view details)

Uploaded CPython 3.11

lmdeploy-0.14.0-cp310-cp310-win_amd64.whl (54.9 MB view details)

Uploaded CPython 3.10Windows x86-64

lmdeploy-0.14.0-cp310-cp310-manylinux2014_x86_64.whl (90.8 MB view details)

Uploaded CPython 3.10

File details

Details for the file lmdeploy-0.14.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.14.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 54.9 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for lmdeploy-0.14.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 73b84f06652dc0d2a592b2cb209baf039d2a8140b39f75fedd96aa085bcc543b
MD5 97b37f3d3f2193f6d25db34b9a10ef4f
BLAKE2b-256 6888d3f3b13926b1bf48d586689b93fa37e63ff90a96fcc246e35fa59a8c5089

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.14.0-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 174631ff81ec70999a290cdbf50dbb6c3b14bf16d4a047e6bae1c472dddd424d
MD5 9e8d1635f01ca61891819bf7cdf43438
BLAKE2b-256 23500084960da4815c500812f571c6fc1d05f52a32e3808d0767eb960fa6f8a7

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.14.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 54.9 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for lmdeploy-0.14.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e453774bd43dddcb1583f9f108cbd2c96a1036b35f81b80a168e7979ba9f18c2
MD5 ca5d4591440c186b30a5522be2ffd3d0
BLAKE2b-256 29f6c05b28e2733ce02bf8a58c3618b8f80328c953da3d8858dbc76ce1f3a3ff

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.14.0-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 550fd29426cb4ea1585252dfb8fe9acf2d5bcda055de53ef100a9e0ec956f366
MD5 4568436e7677e56df47ab9a0957db950
BLAKE2b-256 3e98136c5124c2b6e07f3e43244cd6cb7b9f96a9c16cbf7f62ace8b71c133f8e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.14.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 54.9 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for lmdeploy-0.14.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 948ca377d4fde044f74315e1bbb045dfbd9e9934029ddddf00959b0d773bb06d
MD5 62f9e99e25f0e6e4eb741693ce8e25cc
BLAKE2b-256 bc9f37a82ba7b0b0a7aac6166c892dd7d5bafd5cd36abb8abb850c28fc4b98da

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.14.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4983231306733c7a6928eab0229c8c3898e0c28f9614a5955b3b82766bd71b30
MD5 25b20f2e7f219ab2722fa70896568972
BLAKE2b-256 87bb8705ad70344ec9b68c8f42a3a80e69a8c6cf3b2208458543f76782b4cecc

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.14.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 54.9 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for lmdeploy-0.14.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 67f49e93b8f05aeee0cbf61ffc0c2adfc8b644c98d82e973789bcbc396c9261f
MD5 29513a37f3ca6e95448625415e4d0f93
BLAKE2b-256 8c0b3c3723a87c125f30ac7b17b022ad3f15de404e2e3cdd46fc97f783546d38

See more details on using hashes here.

File details

Details for the file lmdeploy-0.14.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.14.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b7278e0f1778a5e5c203d210b3958176d9ad73d0b9d6d63b47e96e0182a0fb6a
MD5 69a302dbf81be8f5528a8a1be315e6cb
BLAKE2b-256 754d9e9fc7dad48562b88416478079efd11c872e8d394a565bcb252f306afa8a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page