Skip to main content

A toolset for compressing, deploying and serving LLM

Project description


Latest News 🎉

2024
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support for Huawei Ascend. See supported models here
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated into modelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] Support InternVL2 full-series models, InternLM-XComposer2.5 and function call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer here for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing the internlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01] OpenAOE seamless integration with LMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to here
  • [2024/01] Support PyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Click here for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

For detailed inference benchmarks in more devices and more settings, please refer to the following link:

  • A100
  • V100
  • 4090
  • 3090
  • 2080

Supported Models

LLMs VLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM4 (9B)
  • CodeGeeX4 (9B)
  • Falcon (7B - 180B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • MiniCPM3 (4B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • DeepSeek-VL (7B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • MiniGeminiLlama (7B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • Llama3.2-vision (11B, 90B)

LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.

Quick Start Open In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

The default prebuilt package is compiled on CUDA 12 since v0.3.0. For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.

Offline Batch Inference

import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by pip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub by pip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer to here.

Tutorials

Please review getting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to our tutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,
    title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
    author={LMDeploy Contributors},
    howpublished = {\url{https://github.com/InternLM/lmdeploy}},
    year={2023}
}

License

This project is released under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

lmdeploy-0.6.2-cp312-cp312-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.12 Windows x86-64

lmdeploy-0.6.2-cp312-cp312-manylinux2014_x86_64.whl (93.2 MB view details)

Uploaded CPython 3.12

lmdeploy-0.6.2-cp311-cp311-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.11 Windows x86-64

lmdeploy-0.6.2-cp311-cp311-manylinux2014_x86_64.whl (93.1 MB view details)

Uploaded CPython 3.11

lmdeploy-0.6.2-cp310-cp310-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.10 Windows x86-64

lmdeploy-0.6.2-cp310-cp310-manylinux2014_x86_64.whl (93.1 MB view details)

Uploaded CPython 3.10

lmdeploy-0.6.2-cp39-cp39-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.9 Windows x86-64

lmdeploy-0.6.2-cp39-cp39-manylinux2014_x86_64.whl (93.1 MB view details)

Uploaded CPython 3.9

lmdeploy-0.6.2-cp38-cp38-win_amd64.whl (39.1 MB view details)

Uploaded CPython 3.8 Windows x86-64

lmdeploy-0.6.2-cp38-cp38-manylinux2014_x86_64.whl (93.1 MB view details)

Uploaded CPython 3.8

File details

Details for the file lmdeploy-0.6.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 39.1 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f6fa9a0f2d4141e8d08f1c50f0b57ab417a6c45a6287a990febebaffaacee787
MD5 d478a96000b247a1714f7abf7dabef43
BLAKE2b-256 8da530a7999125650aa8a4ea5ce3da6c95b39c347820c6135caf90af918b6fbc

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2b399f6c06250a42652df6c9ad8605637b1413386250179fce5cd6b77212a378
MD5 e21036148d9c1d179c27dc73dddfe77a
BLAKE2b-256 4aa9984bf9ae78b65497e73d1fea602d01ed74ea4b90ff2ca390e2d4183d483d

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 39.1 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4e1783bf370fbe18db489317ca724b6e7e9f65befdc90f8aef70313087703fb7
MD5 383c410963b0859b953a042aa7d9104f
BLAKE2b-256 4711e4bf5b9e1af87f1fcda2000bb67a43b175ddf6663bd740cfb72099e5283e

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb7b26076a1078dfef311970d3cb81d2e8ce02133f4a379ecb35cde48cde3ac2
MD5 905f79fa9fcb4a02a11695872d627d9f
BLAKE2b-256 c3b728a4a11747067cff783b0e217a188b1a0613b5a9a27366895eca6aa719f4

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 39.1 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 730b6b218b85df2098af41cbb65cf9ae6676befb0ab4e6d0f4767be3ee7e61e3
MD5 7ae107224d0e8486a9a3fef2dab22089
BLAKE2b-256 1e6b305fef102d9a0a690ce21587665dda16a30568cd0ca3a3dca1730548f8e0

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b2ccb908400cf7b9444543d15a1ea0f3e4e363dd8fc918b2a668c3a38e04fe17
MD5 95feb082f776bddb94ef9216cb281d6a
BLAKE2b-256 662da24044530d504c18e15676a6ba8e1a388d17e33f03dc6a4dc3ce70988b89

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 39.1 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c641b0e6f63a3846f933d07604ea4d2ae38508d1693e537ed40fda6a9c120549
MD5 beb9d49960b4f3006e69c2b3ecd26004
BLAKE2b-256 1ef4fa60149e74366a0d38541239ab462c3131a4903e4c588bc3f7744fc78251

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fb11225da46cc4e62b398c57887e619a7fdd666084400338c7f862eaaa183fe0
MD5 b1b28f0a10ca0380b7b68b5d0df0fdb1
BLAKE2b-256 dcb41cdf0041afbf21ebb0d3a7f18b1ffd4afbe5ebe768961131155b1d0eec6f

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: lmdeploy-0.6.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 39.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for lmdeploy-0.6.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 dce4dba7cc7675b53f0bf30652e06bceed7c6b8dd2d5efca5e5fdc61df16da66
MD5 9cd8ecfe60d3a994bd77199da2be8b4f
BLAKE2b-256 7c508e05ada3349b8c8a1ee1d4808f7bb1aad5f706808cfdd651d88cf43a79ba

See more details on using hashes here.

File details

Details for the file lmdeploy-0.6.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for lmdeploy-0.6.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea8ee67597902de667b9c971d9fa668cb91ec801f1c6dc4bf15ac7d4542a2926
MD5 f7c1cd493275eaf0a1774e05f4392bd0
BLAKE2b-256 2d0d6bf9805262a879ae35d4acd4227ca288aae43f9d418c270cf7ee1ebea118

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page