Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2024
  • [2024/11] Support Mono-InternVL with PyTorch engine
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

  • A100
  • V100
  • 4090
  • 3090
  • 2080

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM4 (9B)
  • CodeGeeX4 (9B)
  • Falcon (7B - 180B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • MiniCPM3 (4B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • DeepSeek-VL (7B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • Mono-InternVL (2B)
  • ChemVLM (8B-26B)
  • MiniGeminiLlama (7B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • Llama3.2-vision (11B, 90B)
  • Molmo (7B-D,72B)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0. For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

lmdeploy-0.6.3-cp312-cp312-win_amd64.whl (44.9 MB view details)

Uploaded CPython 3.12 Windows x86-64

lmdeploy-0.6.3-cp312-cp312-manylinux2014_x86_64.whl (103.4 MB view details)

Uploaded CPython 3.12

lmdeploy-0.6.3-cp311-cp311-win_amd64.whl (44.9 MB view details)

Uploaded CPython 3.11 Windows x86-64

lmdeploy-0.6.3-cp311-cp311-manylinux2014_x86_64.whl (103.4 MB view details)

Uploaded CPython 3.11

lmdeploy-0.6.3-cp310-cp310-win_amd64.whl (44.9 MB view details)

Uploaded CPython 3.10 Windows x86-64

lmdeploy-0.6.3-cp310-cp310-manylinux2014_x86_64.whl (103.4 MB view details)

Uploaded CPython 3.10

lmdeploy-0.6.3-cp39-cp39-win_amd64.whl (44.8 MB view details)

Uploaded CPython 3.9 Windows x86-64

lmdeploy-0.6.3-cp39-cp39-manylinux2014_x86_64.whl (103.4 MB view details)

Uploaded CPython 3.9

lmdeploy-0.6.3-cp38-cp38-win_amd64.whl (44.9 MB view details)

Uploaded CPython 3.8 Windows x86-64

lmdeploy-0.6.3-cp38-cp38-manylinux2014_x86_64.whl (103.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file lmdeploy-0.6.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 44.9 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6ebbded31b328bc9a25c737ca53fb5dc0bab75a031f42487ce39e827f4110de2
MD5 9d9d7867d45bfe475f8507330fc41600
BLAKE2b-256 e91759fddcf4a7b40ef8a09199d03524749b23e7fda29c40e69c48d5fe7cae80

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.3-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 51c25b896a02a15bd09ec9f15a76e225177557b4b0c53cd9f874913a068375ca
MD5 8afd446750328309f73e04fdbe1f8ff9
BLAKE2b-256 1da0b9a27bd6c297df4163a021f338786f1ae7e9b6d79a5ccf5ffbf07c20e229

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 44.9 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1b866ab901f4bc16bc65fa4dd344ee95002d147e7554148cf313c35b2e10dfb1
MD5 fd125f74ddf387add82a4907cf86c7c7
BLAKE2b-256 293c2e1d083c0e89c346957037b1f5de21845f2c359bf070362bf04b4b797cdf

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eeb513dd0209a5b330e7a740129180b15b4cb260c46c06aed69ad0b39ecf9573
MD5 cd381fd254c290f0f6ea958048c77ba1
BLAKE2b-256 03d5aa9eda6a38154dafcd7fad98235a9a13457fe474fc737c6f2afdf384012e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 44.9 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 059c3bb4cebf2ae6d41a3a9bc488690412d416dbf964fd94f6cb5463d602ab72
MD5 5f89e5288bbb05df151e09661ce5f69b
BLAKE2b-256 436754e2374969609d1e5278178c9bedd898e673d372619c77b798c6ce42d539

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 147965c6f69529bd06c47e8976903138b3e25af15559f565400e18f0481527c6
MD5 1b8b11a4f3e49beea1aee8beb655bdcd
BLAKE2b-256 2988d8ddcb725d2cf25f9346a406c8a95b9b49120e16d6b78a28d2bcd182e34a

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 44.8 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 15bba2ed922a7b5c680dc974c6ff61f76b181dc54556d8c5196367889f417c56
MD5 3e7676972163b4b6dcb7203e3b98e661
BLAKE2b-256 f0e2f5e0c652519ac6ecc85fcab63b5b04ec1c2e3847769be24f719294d5cb3e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f576109cb04e8c7a157be0da2c90a7c7fdce0a515b02cbc17fe7826672203354
MD5 55c24a5c7dfeda03032c4bfc80461e44
BLAKE2b-256 695a3633c19c80a5098d7dc69c4e5c458f40a2da54fa03b847b80095be446c27

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 44.9 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3b8999257ae8e6c8ba3b0c3a6b554b1e1e670230eef551837e947cfd26443790
MD5 495b693c7e78b273ec7c285e89fbf0d2
BLAKE2b-256 4fb6f3b8016467ae665c1883510d290d23bd3cc5d3e44c3a0a95858d6879594e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8e438ec71c4a8ad5145cd148df345ab0ef7a89cb11fda68f3da6a8baa14c6219
MD5 0b7d2aeaccb8a6fbb1c03fbec1220dcc
BLAKE2b-256 07989888075fb4cbb5dfc2648fe52005d9fca85b0460e64f04ee4c82942a4822

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page