A toolset for compressing, deploying and serving LLM

These details have not been verified by PyPI

Project description

English | 简体中文

👋 join us on Twitter, Discord and WeChat

Latest News 🎉

2024

[2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
[2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
[2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.

2023

[2023/12] Turbomind supports multimodal input. Gradio Demo
[2023/11] Turbomind supports loading hf model directly. Click here for details.
[2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
[2023/09] TurboMind supports Qwen-14B
[2023/09] TurboMind supports InternLM-20B
[2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
[2023/09] TurboMind supports Baichuan2-7B
[2023/08] TurboMind supports flash-attention2.
[2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
[2023/08] TurboMind supports Windows (tp=1)
[2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
[2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
[2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
[2023/07] TurboMind supports Llama-2 70B with GQA.
[2023/07] TurboMind supports Llama-2 7B/13B.
[2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

A100
V100
4090
3090
2080

Supported Models

Model	Size
Llama	7B - 65B
Llama2	7B - 70B
InternLM	7B - 20B
InternLM2	7B - 20B
InternLM-XComposer	7B
QWen	7B - 72B
QWen-VL	7B
Baichuan	7B - 13B
Baichuan2	7B - 13B
Code Llama	7B - 34B
ChatGLM2	6B
Falcon	7B - 180B
YI	6B - 34B

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

The default prebuilt package is compiled on CUDA 11.8. However, if CUDA 12+ is required, you can install lmdeploy by:

export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

For more information about inference pipeline, please refer to here.

Tutorials

Please overview getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

User Guide
Advance Guide
- Add chat template
- Add a new model
- gemm tuning
- Long context inference
- Multi-model inference service

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.12.1

Feb 13, 2026

0.12.0

Feb 4, 2026

0.11.1

Dec 24, 2025

0.11.0

Dec 4, 2025

0.10.2

Oct 28, 2025

0.10.1

Sep 26, 2025

0.10.0

Sep 9, 2025

0.9.2.post1

Aug 25, 2025

0.9.2

Jul 26, 2025

0.9.1

Jul 4, 2025

0.9.0

Jun 19, 2025

0.8.0

May 4, 2025

0.7.3

Apr 14, 2025

0.7.2.post1

Mar 21, 2025

0.7.2

Mar 19, 2025

0.7.1

Feb 27, 2025

0.7.0.post3

Feb 10, 2025

0.7.0.post2

Jan 27, 2025

0.7.0.post1

Jan 25, 2025

0.7.0

Jan 15, 2025

0.6.5

Dec 30, 2024

0.6.4

Dec 9, 2024

0.6.3

Nov 16, 2024

0.6.2.post1

Nov 7, 2024

0.6.2

Oct 29, 2024

0.6.1

Sep 28, 2024

0.6.0

Sep 13, 2024

0.6.0a0 pre-release

Aug 26, 2024

0.5.3

Aug 7, 2024

0.5.2.post1

Jul 28, 2024

0.5.2

Jul 26, 2024

0.5.1

Jul 16, 2024

0.5.0

Jul 1, 2024

0.4.2

May 27, 2024

0.4.1

May 7, 2024

0.4.0

Apr 23, 2024

0.3.0

Apr 3, 2024

0.2.6

Mar 19, 2024

0.2.5

Mar 5, 2024

This version

0.2.4

Feb 22, 2024

0.2.3

Feb 6, 2024

0.2.2

Jan 31, 2024

0.2.1

Jan 19, 2024

0.2.0

Jan 17, 2024

0.1.0

Dec 18, 2023

0.0.14

Nov 9, 2023

0.0.13

Oct 30, 2023

0.0.12

Oct 24, 2023

0.0.11

Oct 17, 2023

0.0.10

Sep 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmdeploy-0.2.4-cp311-cp311-win_amd64.whl (64.0 MB view details)

Uploaded Feb 22, 2024 CPython 3.11Windows x86-64

lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded Feb 22, 2024 CPython 3.11

lmdeploy-0.2.4-cp310-cp310-win_amd64.whl (64.0 MB view details)

Uploaded Feb 22, 2024 CPython 3.10Windows x86-64

lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded Feb 22, 2024 CPython 3.10

lmdeploy-0.2.4-cp39-cp39-win_amd64.whl (64.0 MB view details)

Uploaded Feb 22, 2024 CPython 3.9Windows x86-64

lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded Feb 22, 2024 CPython 3.9

lmdeploy-0.2.4-cp38-cp38-win_amd64.whl (64.0 MB view details)

Uploaded Feb 22, 2024 CPython 3.8Windows x86-64

lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded Feb 22, 2024 CPython 3.8

File details

Details for the file lmdeploy-0.2.4-cp311-cp311-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp311-cp311-win_amd64.whl
Upload date: Feb 22, 2024
Size: 64.0 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`fc77779ede40b94a60208c640ccaaf338e6c62915326790260cbc7f7399ada8d`
MD5	`6b518db2ecbf194965e475121dfd7b5b`
BLAKE2b-256	`3c270a189f658f5ec0e031834a90808d3477e534b73043dde7f3cfd126446a91`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl
Upload date: Feb 22, 2024
Size: 94.9 MB
Tags: CPython 3.11
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`d3f5d8a932f7b8e18d28c2c2856ab9fb9d37b1dad1751c400807b1d2e37c1d9d`
MD5	`1e830d0e7e9d00da60d9c41f73041aa7`
BLAKE2b-256	`3764d38d4c623f049125d2cb61bc44f56dd7c3ef218d9bca66de5d3404d2b267`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp310-cp310-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp310-cp310-win_amd64.whl
Upload date: Feb 22, 2024
Size: 64.0 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`9c3f0a091e5c7d0856bc1a08e1238bfed0658451df4a263f947882d3a5e5de2a`
MD5	`87cf4cd3724592bbcab7c54d4a08c8d9`
BLAKE2b-256	`58470a0c384462d9bef2309ca7e1f35231305b695ad6a7124b380c4158124087`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Feb 22, 2024
Size: 94.9 MB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`7d84e6ac082a5473784cb1dd5453943d8ee28d8b1527d8660d2a7f0272e7d0af`
MD5	`ae5f85caa8f85e83e2562e5d4bbc9f04`
BLAKE2b-256	`1a2fd36ba558b9a546446ca0ccdb4abb3189fbfb3a0ef5dfbce1a4a572a18d57`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp39-cp39-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp39-cp39-win_amd64.whl
Upload date: Feb 22, 2024
Size: 64.0 MB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`4b0de7c8ac52e83520beab58a2a46793d8310e122547a2f3327fdd7eeb247188`
MD5	`376467bf587697e73eb71ff4173f177b`
BLAKE2b-256	`d7ce62d566dadc45555a234698151b814771317b699ff4837b4fb3b71aeb36a4`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl
Upload date: Feb 22, 2024
Size: 94.9 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`be9e019de923167ec6ef4fde87c8b1da2ad2fcb2a6059e04b8ae64d04f069232`
MD5	`ea52bd6b0875e1258af5795ce9deac45`
BLAKE2b-256	`54ceb7dd12a5d2fdc254e6e0243be5786c672049284d945093c0dfa144e798d8`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp38-cp38-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp38-cp38-win_amd64.whl
Upload date: Feb 22, 2024
Size: 64.0 MB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`bd93358c6ff2823ec399ea1f60c1387b86daa90ee9942103c3731a19edddc73e`
MD5	`21b91bde96872ddf145279b725cf4f1a`
BLAKE2b-256	`67453c3d15da41dee01b01a2d1e46d9715c691d0f7dd164cebb32a3abc72ffc0`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl
Upload date: Feb 22, 2024
Size: 94.9 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`8c9e26b7881f45fa00ffe9f5cbd35741f6e024191b44059957fae9c78f50ed80`
MD5	`6c1328e701b09567cec0b4afbd55aa88`
BLAKE2b-256	`e514940fe359dc581f41223fb68bfb4d911dcaf82a826fe16429e0a9bd6fbe75`

See more details on using hashes here.

lmdeploy 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Latest News 🎉

Introduction

Performance

Supported Models

Quick Start

Installation

Offline Batch Inference

Tutorials

Contributing

Acknowledgement

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes