Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2025
  • [2025/09] TurboMind supports MXFP4 on NVIDIA GPUs starting from V100, achieving 1.5x the performmance of vLLM on H800 for openai gpt-oss models!
  • [2025/06] Comprehensive inference optimization for FP8 MoE Models
  • [2025/06] DeepSeek PD Disaggregation deployment is now supported through integration with DLSlime and Mooncake. Huge thanks to both teams!
  • [2025/04] Enhance DeepSeek inference performance by integration deepseek-ai techniques: FlashMLA, DeepGemm, DeepEP, MicroBatch and eplb
  • [2025/01] Support DeepSeek V3 and R1
2024
  • [2024/11] Support Mono-InternVL with PyTorch engine
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM3 (8B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Qwen2-MoE (57BA14B)
  • Qwen2.5 (0.5B - 32B)
  • Qwen3, Qwen3-MoE
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM-4 (9B)
  • GLM-4-0414 (9B, 32B)
  • CodeGeeX4 (9B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • DeepSeek-V2.5 (236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • Phi-4-mini (3.8B)
  • MiniCPM3 (4B)
  • SDAR (1.7B-30B)
  • gpt-oss (20B, 120B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • Qwen2.5-VL (3B, 7B, 72B)
  • DeepSeek-VL (7B)
  • DeepSeek-VL2 (3B, 16B, 27B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • InternVL2.5(MPO) (1B-78B)
  • InternVL3 (1B-78B)
  • InternVL3.5 (1B-241BA28B)
  • Intern-S1 (241B)
  • Intern-S1-mini (8.3B)
  • Mono-InternVL (2B)
  • ChemVLM (8B-26B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • GLM-4.1V-Thinking (9B)
  • Llama3.2-vision (11B, 90B)
  • Molmo (7B-D,72B)
  • Gemma3 (1B - 27B)
  • Llama4 (Scout, Maverick)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.9 - 3.13):

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0.

For the GeForce RTX 50 series, please install the LMDeploy prebuilt package complied with CUDA 12.8

export LMDEPLOY_VERSION=0.10.2
export PYTHON_VERSION=310
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu128-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu128

For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}
@article{zhang2025efficient,
  title={Efficient Mixed-Precision Large Language Model Inference with TurboMind},
  author={Zhang, Li and Jiang, Youhe and He, Guoliang and Chen, Xin and Lv, Han and Yao, Qian and Fu, Fangcheng and Chen, Kai},
  journal={arXiv preprint arXiv:2508.15601},
  year={2025}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lazyllm_lmdeploy-0.10.2.dev0-cp313-cp313-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.13Windows x86-64

lazyllm_lmdeploy-0.10.2.dev0-cp312-cp312-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.12Windows x86-64

lazyllm_lmdeploy-0.10.2.dev0-cp311-cp311-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.11Windows x86-64

lazyllm_lmdeploy-0.10.2.dev0-cp310-cp310-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.10Windows x86-64

lazyllm_lmdeploy-0.10.2.dev0-cp39-cp39-win_amd64.whl (28.6 MB view details)

Uploaded CPython 3.9Windows x86-64

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 6d954c213a92a2e775a3734a9d4fe0fb77a1cb0c8eb798f1a10aee69ddb9bd32
MD5 6f8a4bfda1086626e4c464fdcc59a122
BLAKE2b-256 abf83958f3ab1c19ee6a08de326ddb0a59f2c84eab86119405882e1fb02cc481

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8722fdf9755f79a78dcb7b5d2cdd89d8f21f837a0fa7ef7a2ba271c5ad3040e1
MD5 441427c5819a285ed7a2fe9e9d2eb76c
BLAKE2b-256 24a52700ee875a06ea0b8aa6a0cac9c0040c91912ae841c427c6fc04e2bc73ed

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 e71d582c027e9509f146dcdeb3f7123144f95f3fbce8e9b94b869e8cf983590d
MD5 8a3ec04542a2814c71d249496d77ed8d
BLAKE2b-256 877b85b8bb9cad7c2eef2280bcdf176c4e3b8446cd675daaf3e0136a9fe8176d

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 76b79cba2f116fa9e39c29460120b1c4b138dba10134ab8096e41680df4b1e6c
MD5 856d51bb4911584fb0eb485c4890426d
BLAKE2b-256 402fe1878d024a92e005fcf66d712f0a901ccb6d9fa5526fbf413e863e5fed66

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 80bf0d0ba476d64d6df91c800258323fb0ce650d117f117440808a511f19e162
MD5 a19303b2fd91ef92bbda7e51130f0cd7
BLAKE2b-256 82327e4e528ce6f4f2f0a16b0799c4449f70fd8275d2b8f10e07888ee87ad37d

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 89694cea584ff1d79d176da3de2c3a38d7c8cbe220842b191731d16f9afb9460
MD5 9854a64973264e3e01312b607bd2e24d
BLAKE2b-256 edb933709e6a0675ce9fd83b2d82668c70e735c852bcf44e6c8f26d8b69f8c17

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 671fa22444356737a5b94cd79f9f2ff75284102973b8ce14c3a90cf51f684e91
MD5 948c2e1cd3de7287b3945f430250f740
BLAKE2b-256 08e9338ef4ef9f1b9fa329148298a0134eece63c864be57ecf13777d94462777

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 64fcf0b2a4bc412e57eee99a14a93652dacf7001801928ce9bad08ecbe9825be
MD5 133726c447da4c48d9c6de6901359563
BLAKE2b-256 3a7e9903e92035d0d2c080590b002def0f44b168c514a1c6011c99de90356c6c

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 76860ee9abcac0d87a5865339cbd2fd0b60498628367b0e6af740f8bd1d0f21c
MD5 f91a9ffb623cb56a1f716974c83d44fa
BLAKE2b-256 385b337188bce0117c52918c9bdca0da6e5acfa6d0374efa4ea934c762f60cf7

See more details on using hashes here.

File details

Details for the file lazyllm_lmdeploy-0.10.2.dev0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lazyllm_lmdeploy-0.10.2.dev0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 983bc24bf1d0e58fccb3ac6414df1e06c0b967c58d9a4f89dae68a3aa2468477
MD5 855bba1cab5d9c2b0e47d7a274425334
BLAKE2b-256 920b871703cd0be9f7e34148b1d6782bc6eea69934efd444a0b337ede20e6df3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page