Skip to main content

A toolset for compressing, deploying and serving LLM

Project description

docs badge PyPI license issue resolution open issues

English | 简体中文

👋 join us on Twitter, Discord and WeChat


News 🎉

  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.

  • Persistent Batch Inference: Further optimization of model execution efficiency.

PersistentBatchInference

Supported Models

LMDeploy has two inference backends, Pytorch and TurboMind.

TurboMind

Note
W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes Yes Yes No
Llama2 Yes Yes Yes Yes No
InternLM-7B Yes Yes Yes Yes No
InternLM-20B Yes Yes Yes Yes No
QWen-7B Yes Yes Yes No No
QWen-14B Yes Yes Yes No No
Baichuan-7B Yes Yes Yes Yes No
Baichuan2-7B Yes Yes No No No
Code Llama Yes Yes No No No

Pytorch

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes No No No
Llama2 Yes Yes No No No
InternLM-7B Yes Yes No No No

Performance

Case I: output token throughput with fixed input token and output token number (1, 2048)

Case II: request throughput with real conversation data

Test Setting: LLaMA-7B, NVIDIA A100(80G)

The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x. And the request throughput of TurboMind is 30% higher than vLLM.

benchmark

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Deploy InternLM

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b-v1_1 /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b

Inference by TurboMind

python -m lmdeploy.turbomind.chat ./workspace

Note
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind.
It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. Disable GPU ECC can free up 10% memory, try sudo nvidia-smi --ecc-config=0 and reboot system.

Note
Tensor parallel is available to perform inference on multiple GPUs. Add --tp=<num_gpu> on chat to enable runtime TP.

Serving with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

Serving with Restful API

Launch inference server by:

python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1

Then, you can communicate with it by command line,

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url

or webui,

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True

Refer to restful_api.md for more details.

Serving with Triton Inference Server

Launch inference server by:

bash workspace/service_docker_up.sh

Then, you can communicate with the inference server by command line,

python3 -m lmdeploy.serve.client {server_ip_addresss}:33337

or webui,

python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from here

Inference with PyTorch

For detailed instructions on Inference pytorch models, see here.

Single GPU

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

You need to install deepspeed first to use this feature.

pip install deepspeed

Quantization

Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

Click here to view the test results for weight int4 usage.

KV Cache INT8 Quantization

Click here to view the usage method, implementation formula, and test results for kv int8.

Warning
runtime Tensor Parallel for quantized model is not available. Please setup --tp on deploy to enable static TP.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmdeploy-0.0.11-cp311-cp311-win_amd64.whl (55.8 MB view details)

Uploaded CPython 3.11Windows x86-64

lmdeploy-0.0.11-cp311-cp311-manylinux2014_x86_64.whl (104.7 MB view details)

Uploaded CPython 3.11

lmdeploy-0.0.11-cp310-cp310-win_amd64.whl (55.8 MB view details)

Uploaded CPython 3.10Windows x86-64

lmdeploy-0.0.11-cp310-cp310-manylinux2014_x86_64.whl (104.7 MB view details)

Uploaded CPython 3.10

lmdeploy-0.0.11-cp39-cp39-win_amd64.whl (55.8 MB view details)

Uploaded CPython 3.9Windows x86-64

lmdeploy-0.0.11-cp39-cp39-manylinux2014_x86_64.whl (104.6 MB view details)

Uploaded CPython 3.9

lmdeploy-0.0.11-cp38-cp38-win_amd64.whl (55.8 MB view details)

Uploaded CPython 3.8Windows x86-64

lmdeploy-0.0.11-cp38-cp38-manylinux2014_x86_64.whl (104.7 MB view details)

Uploaded CPython 3.8

File details

Details for the file lmdeploy-0.0.11-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.0.11-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 55.8 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.0.11-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b24a03bf6acf595d3bd1dd8e1cb4446a8b1ec6329e1d93773d7d954667a1e384
MD5 b8002574a37d873eaa5f5b1441a7d4ac
BLAKE2b-256 5aa7e68647c15716aab9e05f4c21a13c1394c0bdf31204249198050c16979c7c

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.0.11-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 41b90435e0c07cbd35849b2c7646e375372a3c5442de6bf086e3ff845d662555
MD5 9ec9f19798503477a088483377f22854
BLAKE2b-256 81a6dbb170678083a28788513e26146ba3ac40afc5cb8050464d2d03436c13a6

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.0.11-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 55.8 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.0.11-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 dbd7c44fe4bda056f505ce4b8ae55bb50c9d2740cb8ee9b4a7eafbf498169f2c
MD5 57975cbacdad6dec3fc2e709b0a26e11
BLAKE2b-256 0154528efad12ad48ae4642b0a94d9816a36df302090e9637d26cdd5cfa08cfe

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.0.11-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6869cd71cab06966ed27404a5ed61d5e129673383a333bda39a3cd3d5fa33981
MD5 463bdd398eb2c3abae821a2921c47dc1
BLAKE2b-256 b118a5c55a8c262b4283eed40ee1ddaa3d47b31b0e8429218f414eb370cebd06

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.0.11-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 55.8 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.0.11-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 2a2696b270d8501e8940c6723e013d69ff2b4ab08f8bbfd55155d38aec38f4bc
MD5 54011c8cbae61c0a7a8d4c2ad7efa990
BLAKE2b-256 3f1253c971a6079e1e0a819095c84108c50bc2f4929063d232541e7bc8bed1d9

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.0.11-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 333ab90d1df5e5e5fe91add3bffb6ec25a900399096d34fecce5b03b0569206e
MD5 4cf316c9e8740bfd0c2b46f144662f1b
BLAKE2b-256 88c402cd9c52dd441a39d42077f310a5797212db82752889aa3f90b3bbf6023e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.0.11-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 55.8 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.0.11-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 998b4cbc66c15161d05f6e0d2c47fdf17103fa119dd3dd2e95e8565de2076dc4
MD5 b7a582e94fa64b913d1a256cc8fca05b
BLAKE2b-256 2ee4de288444e7b39e65118684e3270f42ba4f9fa96f355c8a06512dd033248d

See more details on using hashes here.

File details

Details for the file lmdeploy-0.0.11-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.0.11-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 60f0ffdaaede7b58e0bc8507a06541aa73dd2573801598ba2020f527ef84b1e7
MD5 fa54c11c42647f38c82b6c739cf13da3
BLAKE2b-256 6e2f169ac267380b65a9eb8357ec97231d3e8ec336ff1a958dd1ea8637298d36

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page