Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2024
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

  • A100
  • V100
  • 4090
  • 3090
  • 2080

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM4 (9B)
  • CodeGeeX4 (9B)
  • Falcon (7B - 180B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • MiniCPM3 (4B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • DeepSeek-VL (7B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • MiniGeminiLlama (7B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • Llama3.2-vision (11B, 90B)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0. For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

lmdeploy-0.6.2.post1-cp312-cp312-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.12 Windows x86-64

lmdeploy-0.6.2.post1-cp311-cp311-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.11 Windows x86-64

lmdeploy-0.6.2.post1-cp310-cp310-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.10 Windows x86-64

lmdeploy-0.6.2.post1-cp39-cp39-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.9 Windows x86-64

lmdeploy-0.6.2.post1-cp38-cp38-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

File details

Details for the file lmdeploy-0.6.2.post1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ed824a90eaf4a66c10e95bd940b1a4736aef2c7aa25918cdf886a666ae7b957d
MD5 bdb91ed106af6508277c5bc5e25e9af3
BLAKE2b-256 e19f363f73eac7564b023a0868af258645bf416130fe39530696000db0e1cca7

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a4d64578537c593ab2bf92fe21e13b68c3b7a0f251464193a4a852eac745d09e
MD5 619908a08f3f4565cf6f76924dfd3d6a
BLAKE2b-256 8de28d22978da8951ce391edb0d5f12e6fe73ac09a4201cbb87c1329ee73327c

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8dda4bacf5676080e22b409f57241493b7438b8fb0cf5342688b11b812f2c81c
MD5 9d6a0e410296d8ee48504e5454dda46a
BLAKE2b-256 0f4ad99e620b90bab60bbe36cb1ff22ff4e5e0727af14a8f3962b88f0159b137

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 67487554d84255e621181aaf66aca308fd5c51b16a9af42acf90c403ceab4e87
MD5 8b8182dc8bb384fd154ce7d1764441d6
BLAKE2b-256 87eb6b2fe38222be8bde111be6d4ce5cc3514f00c8e129d23405e6c196de4018

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 8ecc5c57b72ee5d3b7b7a1f81885508194a2eac29d1bbd116261216a169e0a83
MD5 bef80aa4896a5c2096f5c4f9741f1beb
BLAKE2b-256 11520c7bc97b14f107d675b95a02f085368bbb747e516ded723cd0f65895befb

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a00d41d1d8474b486d1d88fc8fb037a3008a36c5f7a022bc430c2fd4c5de594
MD5 60055005fded9824ed5c549bee4b03ab
BLAKE2b-256 0e2d2e3b7eecc5f3ab8f7a68359f45fa9336ce2a9e927de7d8137ab1dcdeaa2a

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 f842aba5c83dfc780e146ccf303ce02d675ec4c39fa2c97b650918058dac3e7b
MD5 65873a6045282ca0c21d019c022975b2
BLAKE2b-256 26f95107ac4e62b0333ef5a0a40359959eaffe18f723618ac83f3b0f586b89eb

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ae1390de0c3d0c816a16eab48d69db7802c9a58366b2fe25167f496f1835b0e
MD5 83c3128624323d87d97bfbf60282c534
BLAKE2b-256 bb666bb8c29adb22c3bef721b2efa502330376cd86ff38ee9eecbdf0ff064f60

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 5d27f5190fc041801b85d26cffbf9dbe1caf1a1e8c9d021032483e4126a4eea7
MD5 0948cb9401bc2c1c02385fe871f7ea30
BLAKE2b-256 4da8ed51aa9a4de7d26bf547231b42451d4df959cd874d36aa649d53b3beaa71

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2.post1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2.post1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 42d6b6087d131df694dc8c2c8e9e0d22a3199089a1d999f716acbb18a08e8ddb
MD5 0cf5fd129624d3bd589974c324894859
BLAKE2b-256 42b2f73cfabc7966a6c0097ac7b6efcbb64afb5f667054431985d49568cad1b5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page