Skip to main content

A toolset for compressing, deploying and serving LLM

Project description

docs badge PyPI license issue resolution open issues

English | 简体中文

👋 join us on Twitter, Discord and WeChat


News 🎉

  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.

  • Persistent Batch Inference: Further optimization of model execution efficiency.

PersistentBatchInference

Supported Models

LMDeploy has two inference backends, Pytorch and TurboMind. You can run lmdeploy list to check the supported model names.

TurboMind

Note
W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes Yes Yes No
Llama2 Yes Yes Yes Yes No
SOLAR Yes Yes Yes Yes No
InternLM-7B Yes Yes Yes Yes No
InternLM-20B Yes Yes Yes Yes No
QWen-7B Yes Yes Yes Yes No
QWen-14B Yes Yes Yes Yes No
Baichuan-7B Yes Yes Yes Yes No
Baichuan2-7B Yes Yes Yes Yes No
Code Llama Yes Yes No No No

Pytorch

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes No No No
Llama2 Yes Yes No No No
InternLM-7B Yes Yes No No No

Performance

Case I: output token throughput with fixed input token and output token number (1, 2048)

Case II: request throughput with real conversation data

Test Setting: LLaMA-7B, NVIDIA A100(80G)

The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x. And the request throughput of TurboMind is 30% higher than vLLM.

benchmark

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Note
pip install lmdeploy can only install the runtime required packages. If users want to run codes from modules like lmdeploy.lite and lmdeploy.serve, they need to install the extra required packages. For instance, running pip install lmdeploy[lite] would install extra dependencies for lmdeploy.lite module.

  • all: Install lmdeploy with all dependencies in requirements.txt
  • lite: Install lmdeploy with extra dependencies in requirements/lite.txt
  • serve: Install lmdeploy with dependencies in requirements/serve.txt

Deploy InternLM

To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.

The following use internlm/internlm-chat-7b as a example to show how to use turbomind with online conversion. You can refer to load_hf.md for other methods.

Inference by TurboMind

lmdeploy chat turbomind internlm/internlm-chat-7b --model-name internlm-chat-7b

Note
The internlm/internlm-chat-7b model will be downloaded under .cache folder. You can also use a local path here.

Note
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind.
It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. Disable GPU ECC can free up 10% memory, try sudo nvidia-smi --ecc-config=0 and reboot system.

Note
Tensor parallel is available to perform inference on multiple GPUs. Add --tp=<num_gpu> on chat to enable runtime TP.

Serving with gradio

# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve gradio internlm/internlm-chat-7b --model-name internlm-chat-7b

Serving with Restful API

Launch inference server by:

# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

lmdeploy serve api_server internlm/internlm-chat-7b --model-name internlm-chat-7b --instance_num 32 --tp 1

Then, you can communicate with it by command line,

# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client api_server_url

or webui,

# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}

Refer to restful_api.md for more details.

Inference with PyTorch

For detailed instructions on Inference pytorch models, see here.

Single GPU

lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

You need to install deepspeed first to use this feature.

pip install deepspeed

Quantization

Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

Click here to view the test results for weight int4 usage.

KV Cache INT8 Quantization

Click here to view the usage method, implementation formula, and test results for kv int8.

Warning
runtime Tensor Parallel for quantized model is not available. Please setup --tp on deploy to enable static TP.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmdeploy-0.1.0-cp311-cp311-win_amd64.whl (63.8 MB view details)

Uploaded CPython 3.11Windows x86-64

lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl (94.7 MB view details)

Uploaded CPython 3.11

lmdeploy-0.1.0-cp310-cp310-win_amd64.whl (63.8 MB view details)

Uploaded CPython 3.10Windows x86-64

lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl (94.6 MB view details)

Uploaded CPython 3.10

lmdeploy-0.1.0-cp39-cp39-win_amd64.whl (63.8 MB view details)

Uploaded CPython 3.9Windows x86-64

lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl (94.6 MB view details)

Uploaded CPython 3.9

lmdeploy-0.1.0-cp38-cp38-win_amd64.whl (63.8 MB view details)

Uploaded CPython 3.8Windows x86-64

lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl (94.7 MB view details)

Uploaded CPython 3.8

File details

Details for the file lmdeploy-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 63.8 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 0aa65c954a1c7da54c7b723d09f1373b8a27cce2813d5d3218578765a71c0503
MD5 aa4c273b9e99c30278531f5a796e772d
BLAKE2b-256 bcf45d2297cb6812fdae684c4ba1dee29f94a03cdc60f2754425a6c19e79f906

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.1.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8a998b7342d677c2ea232e49ef02c70ee8c51a5720312e5afe49a5602f11ab7a
MD5 8405335b45832fda499c7222efca0d3e
BLAKE2b-256 8c67c764981b254307cab0c40634897b889f200d5eb2a96256bc1d32f3c41663

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.1.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 63.8 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d8286db8559c18d0bd68d1fcfb6505b3455b8a4a1f3e37d2ca8ae916b88143a5
MD5 cf690ad4fdcaf2520f73056f7c2f469a
BLAKE2b-256 4e60ca3470e6db00104e8b792dad2bf724b35e0541cb4682507fc113c439e1b8

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.1.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 08cff4ce31d790015616ae3f486e97a4958aa7f7b5134af6a165d81994844205
MD5 3757b4da7e1cf4c3eb20a19f4fe9c311
BLAKE2b-256 255e44da0d52ac04e800724ccc98b12ed346ec8f3e3df61eded2844d185a1220

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 63.8 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 ef8f0b50a6c13390343b0681585228b825bd476d54737ce7d39cfeafdfda7640
MD5 e4e502927c732db6800bc91cbe353ae7
BLAKE2b-256 b0e25aa0f99eebc31d376fd753a968db1bbc14deb9d1cf9ddc2ca9298affb1eb

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.1.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a1daf67348d9a1a8334406f064cf72e4270cf7a58fcd0817c4ff58d7394b9d9e
MD5 7c4bf7125b2ea2de0521a431c88f6704
BLAKE2b-256 bdc785800bc162ece7864dc1583ab50e102607b8cda8979a73e04b7adee1a634

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 63.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 14cdf3f32d13019a7faafe9ffaf298f42786a0bcda01c552f7d26708e010f880
MD5 99958ae65347da435e4f5d59b1d27608
BLAKE2b-256 5ff99d028a529fe663cad960106d854eb9f90634bb966abd59bf2d1261e9d6ef

See more details on using hashes here.

File details

Details for the file lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.1.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aa3075555bfe93b41d8e198eee84458275afbe848a914faaed745f8c626158e9
MD5 3236c1e2f037b0408b373f0f308d4a25
BLAKE2b-256 f1dc4e94f37ce6bf61d3d32b7177f41e4366534dcdfd4ede3166fc95635a7943

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page