Skip to main content

A toolset for compressing, deploying and serving LLM

Project description

docs badge PyPI license issue resolution open issues

English | 简体中文

👋 join us on Twitter, Discord and WeChat


Latest News 🎉

2024
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input. Gradio Demo
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

  • A100
  • V100
  • 4090
  • 3090
  • 2080

Supported Models

Model Size
Llama 7B - 65B
Llama2 7B - 70B
InternLM 7B - 20B
InternLM2 7B - 20B
InternLM-XComposer 7B
QWen 7B - 72B
QWen-VL 7B
Baichuan 7B - 13B
Baichuan2 7B - 13B
Code Llama 7B - 34B
ChatGLM2 6B
Falcon 7B - 180B
YI 6B - 34B

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

The default prebuilt package is compiled on CUDA 11.8. However, if CUDA 12+ is required, you can install lmdeploy by:

export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

For more information about inference pipeline, please refer to here.

Tutorials

Please overview getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

lmdeploy-0.2.4-cp311-cp311-win_amd64.whl (64.0 MB view details)

Uploaded CPython 3.11Windows x86-64

lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded CPython 3.11

lmdeploy-0.2.4-cp310-cp310-win_amd64.whl (64.0 MB view details)

Uploaded CPython 3.10Windows x86-64

lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded CPython 3.10

lmdeploy-0.2.4-cp39-cp39-win_amd64.whl (64.0 MB view details)

Uploaded CPython 3.9Windows x86-64

lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded CPython 3.9

lmdeploy-0.2.4-cp38-cp38-win_amd64.whl (64.0 MB view details)

Uploaded CPython 3.8Windows x86-64

lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl (94.9 MB view details)

Uploaded CPython 3.8

File details

Details for the file lmdeploy-0.2.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.2.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 64.0 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fc77779ede40b94a60208c640ccaaf338e6c62915326790260cbc7f7399ada8d
MD5 6b518db2ecbf194965e475121dfd7b5b
BLAKE2b-256 3c270a189f658f5ec0e031834a90808d3477e534b73043dde7f3cfd126446a91

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.2.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d3f5d8a932f7b8e18d28c2c2856ab9fb9d37b1dad1751c400807b1d2e37c1d9d
MD5 1e830d0e7e9d00da60d9c41f73041aa7
BLAKE2b-256 3764d38d4c623f049125d2cb61bc44f56dd7c3ef218d9bca66de5d3404d2b267

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.2.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 64.0 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9c3f0a091e5c7d0856bc1a08e1238bfed0658451df4a263f947882d3a5e5de2a
MD5 87cf4cd3724592bbcab7c54d4a08c8d9
BLAKE2b-256 58470a0c384462d9bef2309ca7e1f35231305b695ad6a7124b380c4158124087

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.2.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d84e6ac082a5473784cb1dd5453943d8ee28d8b1527d8660d2a7f0272e7d0af
MD5 ae5f85caa8f85e83e2562e5d4bbc9f04
BLAKE2b-256 1a2fd36ba558b9a546446ca0ccdb4abb3189fbfb3a0ef5dfbce1a4a572a18d57

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.2.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 64.0 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 4b0de7c8ac52e83520beab58a2a46793d8310e122547a2f3327fdd7eeb247188
MD5 376467bf587697e73eb71ff4173f177b
BLAKE2b-256 d7ce62d566dadc45555a234698151b814771317b699ff4837b4fb3b71aeb36a4

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.2.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be9e019de923167ec6ef4fde87c8b1da2ad2fcb2a6059e04b8ae64d04f069232
MD5 ea52bd6b0875e1258af5795ce9deac45
BLAKE2b-256 54ceb7dd12a5d2fdc254e6e0243be5786c672049284d945093c0dfa144e798d8

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.2.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 64.0 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.2.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bd93358c6ff2823ec399ea1f60c1387b86daa90ee9942103c3731a19edddc73e
MD5 21b91bde96872ddf145279b725cf4f1a
BLAKE2b-256 67453c3d15da41dee01b01a2d1e46d9715c691d0f7dd164cebb32a3abc72ffc0

See more details on using hashes here.

File details

Details for the file lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.2.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8c9e26b7881f45fa00ffe9f5cbd35741f6e024191b44059957fae9c78f50ed80
MD5 6c1328e701b09567cec0b4afbd55aa88
BLAKE2b-256 e514940fe359dc581f41223fb68bfb4d911dcaf82a826fe16429e0a9bd6fbe75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page