lmdeploy

A toolset for compressing, deploying and serving LLM

These details have not been verified by PyPI

Project description

English | 简体中文

👋 join us on Twitter, Discord and WeChat

News 🎉

[2023/11] Turbomind supports loading hf model directly. Click here for details.
[2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
[2023/09] TurboMind supports Qwen-14B
[2023/09] TurboMind supports InternLM-20B
[2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
[2023/09] TurboMind supports Baichuan2-7B
[2023/08] TurboMind supports flash-attention2.
[2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
[2023/08] TurboMind supports Windows (tp=1)
[2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check this guide for detailed info
[2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
[2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
[2023/07] TurboMind supports Llama-2 70B with GQA.
[2023/07] TurboMind supports Llama-2 7B/13B.
[2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
Persistent Batch Inference: Further optimization of model execution efficiency.

PersistentBatchInference

Supported Models

LMDeploy has two inference backends, Pytorch and TurboMind. You can run lmdeploy list to check the supported model names.

TurboMind

Note
W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Models	Tensor Parallel	FP16	KV INT8	W4A16	W8A8
Llama	Yes	Yes	Yes	Yes	No
Llama2	Yes	Yes	Yes	Yes	No
SOLAR	Yes	Yes	Yes	Yes	No
InternLM-7B	Yes	Yes	Yes	Yes	No
InternLM-20B	Yes	Yes	Yes	Yes	No
QWen-7B	Yes	Yes	Yes	Yes	No
QWen-14B	Yes	Yes	Yes	Yes	No
Baichuan-7B	Yes	Yes	Yes	Yes	No
Baichuan2-7B	Yes	Yes	Yes	Yes	No
Code Llama	Yes	Yes	No	No	No

Pytorch

Models	Tensor Parallel	FP16	KV INT8	W4A16	W8A8
Llama	Yes	Yes	No	No	No
Llama2	Yes	Yes	No	No	No
InternLM-7B	Yes	Yes	No	No	No

Performance

Case I: output token throughput with fixed input token and output token number (1, 2048)

Case II: request throughput with real conversation data

Test Setting: LLaMA-7B, NVIDIA A100(80G)

The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x. And the request throughput of TurboMind is 30% higher than vLLM.

benchmark

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Note
pip install lmdeploy can only install the runtime required packages. If users want to run codes from modules like lmdeploy.lite and lmdeploy.serve, they need to install the extra required packages. For instance, running pip install lmdeploy[lite] would install extra dependencies for lmdeploy.lite module.

all: Install lmdeploy with all dependencies in requirements.txt

lite: Install lmdeploy with extra dependencies in requirements/lite.txt

serve: Install lmdeploy with dependencies in requirements/serve.txt

Deploy InternLM

To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

The following use internlm/internlm-chat-7b as a example to show how to use turbomind with online conversion. You can refer to load_hf.md for other methods.

Inference by TurboMind

lmdeploy chat turbomind internlm/internlm-chat-7b --model-name internlm-chat-7b

Note
The internlm/internlm-chat-7b model will be downloaded under .cache folder. You can also use a local path here.

Note
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind.
It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. Disable GPU ECC can free up 10% memory, try sudo nvidia-smi --ecc-config=0 and reboot system.

Note
Tensor parallel is available to perform inference on multiple GPUs. Add --tp=<num_gpu> on chat to enable runtime TP.

Serving with gradio

# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve gradio internlm/internlm-chat-7b --model-name internlm-chat-7b

Serving with Restful API

Launch inference server by:

# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve api_server internlm/internlm-chat-7b --model-name internlm-chat-7b --instance_num 32 --tp 1

Then, you can communicate with it by command line,

# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url

or webui,

# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}

Refer to restful_api.md for more details.

Inference with PyTorch

For detailed instructions on Inference pytorch models, see here.

Single GPU

lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

You need to install deepspeed first to use this feature.

pip install deepspeed

Quantization

Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

Click here to view the test results for weight int4 usage.

KV Cache INT8 Quantization

Click here to view the usage method, implementation formula, and test results for kv int8.

Warning
runtime Tensor Parallel for quantized model is not available. Please setup --tp on deploy to enable static TP.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.12.1

Feb 13, 2026

0.12.0

Feb 4, 2026

0.11.1

Dec 24, 2025

0.11.0

Dec 4, 2025

0.10.2

Oct 28, 2025

0.10.1

Sep 26, 2025

0.10.0

Sep 9, 2025

0.9.2.post1

Aug 25, 2025

0.9.2

Jul 26, 2025

0.9.1

Jul 4, 2025

0.9.0

Jun 19, 2025

0.8.0

May 4, 2025

0.7.3

Apr 14, 2025

0.7.2.post1

Mar 21, 2025

0.7.2

Mar 19, 2025

0.7.1

Feb 27, 2025

0.7.0.post3

Feb 10, 2025

0.7.0.post2

Jan 27, 2025

0.7.0.post1

Jan 25, 2025

0.7.0

Jan 15, 2025

0.6.5

Dec 30, 2024

0.6.4

Dec 9, 2024

0.6.3

Nov 16, 2024

0.6.2.post1

Nov 7, 2024

0.6.2

Oct 29, 2024

0.6.1

Sep 28, 2024

0.6.0

Sep 13, 2024

0.6.0a0 pre-release

Aug 26, 2024

0.5.3

Aug 7, 2024

0.5.2.post1

Jul 28, 2024

0.5.2

Jul 26, 2024

0.5.1

Jul 16, 2024

0.5.0

Jul 1, 2024

0.4.2

May 27, 2024

0.4.1

May 7, 2024

0.4.0

Apr 23, 2024

0.3.0

Apr 3, 2024

0.2.6

Mar 19, 2024

0.2.5

Mar 5, 2024

0.2.4

Feb 22, 2024

0.2.3

Feb 6, 2024

0.2.2

Jan 31, 2024

0.2.1

Jan 19, 2024

0.2.0

Jan 17, 2024

This version

0.1.0

Dec 18, 2023

0.0.14

Nov 9, 2023

0.0.13

Oct 30, 2023

0.0.12

Oct 24, 2023

0.0.11

Oct 17, 2023

0.0.10

Sep 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmdeploy-0.1.0-cp311-cp311-win_amd64.whl (63.8 MB view details)

Uploaded Dec 18, 2023 CPython 3.11Windows x86-64

lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl (94.7 MB view details)

Uploaded Dec 18, 2023 CPython 3.11

lmdeploy-0.1.0-cp310-cp310-win_amd64.whl (63.8 MB view details)

Uploaded Dec 18, 2023 CPython 3.10Windows x86-64

lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl (94.6 MB view details)

Uploaded Dec 18, 2023 CPython 3.10

lmdeploy-0.1.0-cp39-cp39-win_amd64.whl (63.8 MB view details)

Uploaded Dec 18, 2023 CPython 3.9Windows x86-64

lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl (94.6 MB view details)

Uploaded Dec 18, 2023 CPython 3.9

lmdeploy-0.1.0-cp38-cp38-win_amd64.whl (63.8 MB view details)

Uploaded Dec 18, 2023 CPython 3.8Windows x86-64

lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl (94.7 MB view details)

Uploaded Dec 18, 2023 CPython 3.8

File details

Details for the file lmdeploy-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp311-cp311-win_amd64.whl
Upload date: Dec 18, 2023
Size: 63.8 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`0aa65c954a1c7da54c7b723d09f1373b8a27cce2813d5d3218578765a71c0503`
MD5	`aa4c273b9e99c30278531f5a796e772d`
BLAKE2b-256	`bcf45d2297cb6812fdae684c4ba1dee29f94a03cdc60f2754425a6c19e79f906`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl
Upload date: Dec 18, 2023
Size: 94.7 MB
Tags: CPython 3.11
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`8a998b7342d677c2ea232e49ef02c70ee8c51a5720312e5afe49a5602f11ab7a`
MD5	`8405335b45832fda499c7222efca0d3e`
BLAKE2b-256	`8c67c764981b254307cab0c40634897b889f200d5eb2a96256bc1d32f3c41663`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp310-cp310-win_amd64.whl
Upload date: Dec 18, 2023
Size: 63.8 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`d8286db8559c18d0bd68d1fcfb6505b3455b8a4a1f3e37d2ca8ae916b88143a5`
MD5	`cf690ad4fdcaf2520f73056f7c2f469a`
BLAKE2b-256	`4e60ca3470e6db00104e8b792dad2bf724b35e0541cb4682507fc113c439e1b8`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Dec 18, 2023
Size: 94.6 MB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`08cff4ce31d790015616ae3f486e97a4958aa7f7b5134af6a165d81994844205`
MD5	`3757b4da7e1cf4c3eb20a19f4fe9c311`
BLAKE2b-256	`255e44da0d52ac04e800724ccc98b12ed346ec8f3e3df61eded2844d185a1220`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp39-cp39-win_amd64.whl
Upload date: Dec 18, 2023
Size: 63.8 MB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`ef8f0b50a6c13390343b0681585228b825bd476d54737ce7d39cfeafdfda7640`
MD5	`e4e502927c732db6800bc91cbe353ae7`
BLAKE2b-256	`b0e25aa0f99eebc31d376fd753a968db1bbc14deb9d1cf9ddc2ca9298affb1eb`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl
Upload date: Dec 18, 2023
Size: 94.6 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`a1daf67348d9a1a8334406f064cf72e4270cf7a58fcd0817c4ff58d7394b9d9e`
MD5	`7c4bf7125b2ea2de0521a431c88f6704`
BLAKE2b-256	`bdc785800bc162ece7864dc1583ab50e102607b8cda8979a73e04b7adee1a634`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp38-cp38-win_amd64.whl
Upload date: Dec 18, 2023
Size: 63.8 MB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`14cdf3f32d13019a7faafe9ffaf298f42786a0bcda01c552f7d26708e010f880`
MD5	`99958ae65347da435e4f5d59b1d27608`
BLAKE2b-256	`5ff99d028a529fe663cad960106d854eb9f90634bb966abd59bf2d1261e9d6ef`

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

Download URL: lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl
Upload date: Dec 18, 2023
Size: 94.7 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`aa3075555bfe93b41d8e198eee84458275afbe848a914faaed745f8c626158e9`
MD5	`3236c1e2f037b0408b373f0f308d4a25`
BLAKE2b-256	`f1dc4e94f37ce6bf61d3d32b7177f41e4366534dcdfd4ede3166fc95635a7943`

See more details on using hashes here.

lmdeploy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

News 🎉

Introduction

Supported Models

TurboMind

Pytorch

Performance

Quick Start

Installation

Deploy InternLM

Inference by TurboMind

Serving with gradio

Serving with Restful API

Inference with PyTorch

Single GPU

Tensor Parallel with DeepSpeed

Quantization

Weight INT4 Quantization

KV Cache INT8 Quantization

Contributing

Acknowledgement

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes