Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2026
2025
  • [2025/09] TurboMind supports MXFP4 on NVIDIA GPUs starting from V100, achieving 1.5x the performmance of vLLM on H800 for openai gpt-oss models!
  • [2025/06] Comprehensive inference optimization for FP8 MoE Models
  • [2025/06] DeepSeek PD Disaggregation deployment is now supported through integration with DLSlime and Mooncake. Huge thanks to both teams!
  • [2025/04] Enhance DeepSeek inference performance by integration deepseek-ai techniques: FlashMLA, DeepGemm, DeepEP, MicroBatch and eplb
  • [2025/01] Support DeepSeek V3 and R1
2024
  • [2024/11] Support Mono-InternVL with PyTorch engine
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM3 (8B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Qwen2-MoE (57BA14B)
  • Qwen2.5 (0.5B - 32B)
  • Qwen3, Qwen3-MoE
  • Qwen3-Next(80B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM-4 (9B)
  • GLM-4-0414 (9B, 32B)
  • CodeGeeX4 (9B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • DeepSeek-V2.5 (236B)
  • DeepSeek-V3 (685B)
  • DeepSeek-V3.2 (685B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • Phi-4-mini (3.8B)
  • MiniCPM3 (4B)
  • SDAR (1.7B-30B)
  • gpt-oss (20B, 120B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • Qwen2.5-VL (3B, 7B, 72B)
  • Qwen3-VL (2B - 235B)
  • DeepSeek-VL (7B)
  • DeepSeek-VL2 (3B, 16B, 27B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • InternVL2.5(MPO) (1B-78B)
  • InternVL3 (1B-78B)
  • InternVL3.5 (1B-241BA28B)
  • Intern-S1 (241B)
  • Intern-S1-mini (8.3B)
  • Intern-S1-Pro (1TB)
  • Mono-InternVL (2B)
  • ChemVLM (8B-26B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • GLM-4.1V-Thinking (9B)
  • Llama3.2-vision (11B, 90B)
  • Molmo (7B-D,72B)
  • Gemma3 (1B - 27B)
  • Llama4 (Scout, Maverick)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.10 - 3.13):

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0.

For the GeForce RTX 50 series, please install the LMDeploy prebuilt package complied with CUDA 12.8

export LMDEPLOY_VERSION=0.12.0
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu128-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu128

For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}
@article{zhang2025efficient,
  title={Efficient Mixed-Precision Large Language Model Inference with TurboMind},
  author={Zhang, Li and Jiang, Youhe and He, Guoliang and Chen, Xin and Lv, Han and Yao, Qian and Fu, Fangcheng and Chen, Kai},
  journal={arXiv preprint arXiv:2508.15601},
  year={2025}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmdeploy-0.12.0-cp313-cp313-win_amd64.whl (28.2 MB view details)

Uploaded CPython 3.13Windows x86-64

lmdeploy-0.12.0-cp313-cp313-manylinux2014_x86_64.whl (90.6 MB view details)

Uploaded CPython 3.13

lmdeploy-0.12.0-cp312-cp312-win_amd64.whl (28.2 MB view details)

Uploaded CPython 3.12Windows x86-64

lmdeploy-0.12.0-cp312-cp312-manylinux2014_x86_64.whl (90.6 MB view details)

Uploaded CPython 3.12

lmdeploy-0.12.0-cp311-cp311-win_amd64.whl (28.2 MB view details)

Uploaded CPython 3.11Windows x86-64

lmdeploy-0.12.0-cp311-cp311-manylinux2014_x86_64.whl (90.6 MB view details)

Uploaded CPython 3.11

lmdeploy-0.12.0-cp310-cp310-win_amd64.whl (28.2 MB view details)

Uploaded CPython 3.10Windows x86-64

lmdeploy-0.12.0-cp310-cp310-manylinux2014_x86_64.whl (90.6 MB view details)

Uploaded CPython 3.10

File details

Details for the file lmdeploy-0.12.0-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.12.0-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 28.2 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lmdeploy-0.12.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 2b7b78498de23265f4ca763e65f83a5c88388db7bd08602d1489b088ac7f6a4e
MD5 18e7ea3285276d5b4aad34b07f3c616b
BLAKE2b-256 097d82f191e2f3a3517b45af956d152d7a0a03a9645b19a453240115e9504e1b

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.12.0-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a552f110b7a89b5f8670976ed49dafd83d7c84a526cabead418eaac64e2a498e
MD5 923b6a6abf77cb2eb94b90cd4367dca1
BLAKE2b-256 17b5d3309ef9a08a77e53e1985a48795fc10af9bb7c90daabe133f0921e3be51

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.12.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 28.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lmdeploy-0.12.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1cf398b18d66d77272f93499796e34a23549f6e1c40b162f8d38dc22f935fc0a
MD5 e52b2aad492b077a7354507f68160148
BLAKE2b-256 1fd8a3b8ba8111b87de33cef843de76bd3812746159bc67768a0cc7f02d10df2

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.12.0-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 85c79706f8aa9bb3a95f72d9bbdcc25e25476e43f6515595ae6a29324cd0f477
MD5 2ebc380c4466ca801b256558ca0fe667
BLAKE2b-256 2dd437a53f5fcfe99cd9b3e2932b08a8a5acbb1a1049b886c342cd1babbee425

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.12.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 28.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lmdeploy-0.12.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ded6b1cabd5d908c9b42076537ff67f9a2a2a18f9adf0c0782d4ea21525b80d1
MD5 8df22004eaa7eebf4f55a1bf8ece970b
BLAKE2b-256 4c71273d9eb86dabee2c9a04dcd06234be6e8e83e56e5b54a2cdf33e4539ee3e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.12.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 22dc5c00cd14b374f9f3ebae45a24e142b4a55926a681d9d01eababa2807b10a
MD5 90653778e6cc88616a1c4ff45da97b7a
BLAKE2b-256 d85bc23a3f0537a4a59667ed087d57a89b929b0d288ba5146469d523ca1462e9

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.12.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 28.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for lmdeploy-0.12.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2db7f6229b402616c7c8a15871af0f0f660652e667eeb07ab3ce17a544261625
MD5 d632407aa49a051bc5afbf8b9b6f8131
BLAKE2b-256 a6ccafe012bd550154bbe343dedc58f445ff19f073ea1b578cceaf4fccd60b2f

See more details on using hashes here.

File details

Details for the file lmdeploy-0.12.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.12.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8f150939409d3bb05fc1ece565e14b50b31434a5f67f29ac9b8988c6e711baf3
MD5 c6bfc52b52594a9c2b975620d07fef11
BLAKE2b-256 3acaedf3ab054c96083389b84149b375f21e19d31043d6085dea9de2d45a503f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page