Skip to main content

This package is used for evaluating large foundation models quantization in deep learning.

Project description

Large Models Quantization (LMQuant)

LMQuant is an open source large models quantization toolbox based on PyTorch. It supports accuracy evaluation with simulated pseudo quantization and dumps quantized weights with scaling factors for system evalation. Apart from LMQuant, we also released QServe for efficient GPU inference of large language models.

The current release supports:

  • SmoothQuant, AWQ, GPTQ-R, and QoQ quantization for large language models

News

  • [2024/05] 🔥 We released QoQ LLM quantization code.

Contents

Installation

  1. Clone this repository and navigate to lmquant folder
git clone https://github.com/mit-han-lab/lmquant
cd lmquant
  1. Install Package
conda env create -f environment.yml
poetry install

Highlights

QoQ: W4A8KV4 Quantization for Efficient LLM Serving

[Paper][Code]

We introduce W4A8KV4 (4-bit weights, 8-bit activations, and 4-bit KV cache) quantization algorithm QoQ and inference system QServe to accelerate LLM serving on GPUs. The key insight is that the efficiency of LLM serving on GPUs is mainly influenced by operations on low-throughput CUDA cores rather than high-throughput tensor cores. Building upon this insight, we propose progressive quantization that can substantially reduce weight dequantization cost through register-level parallelism and a subtraction after multiplication computation order. We also develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. As a result, we improve the serving throughput of popular LLMs by up to 2.4× on A100, 3.5× on L40S, surpassing the leading industry solution TensorRT-LLM. Remarkably, our system on L40S GPU can achieve even higher throughput compared to TensorRT-LLM on A100, effectively reducing the dollar cost of LLM serving by 3×.

QoQ

Support List

Large Language Model Quantization

Models Sizes QoQ (W4A8KV8) AWQ (W4A16) GPTQ-R (W4A16) SmoothQuant (W8A8)
Llama3 8B/70B
Llama2 7B/13B/70B
Llama 7B/13B/30B
Mistral 7B
Mixtral 8x7B
Yi 34B

Reference

If you find lmquant useful or relevant to your research, please kindly cite our paper:

@article{lin2024qserve,
  title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
  author={Lin, Yujun and Tang, Haotian and Yang, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
  year={2024}
}

Related Projects

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmquant-0.0.0.tar.gz (127.1 kB view details)

Uploaded Source

Built Distribution

lmquant-0.0.0-py3-none-any.whl (169.1 kB view details)

Uploaded Python 3

File details

Details for the file lmquant-0.0.0.tar.gz.

File metadata

  • Download URL: lmquant-0.0.0.tar.gz
  • Upload date:
  • Size: 127.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Linux/5.15.0-102-generic

File hashes

Hashes for lmquant-0.0.0.tar.gz
Algorithm Hash digest
SHA256 28d8c9480edcf29dabe9b05a022b35b62f719633f3ede86af77a3b195406d51f
MD5 386fb017397c29491093a4ec117c6acf
BLAKE2b-256 5a3afbe119388c5ff7ba0dd8e138a91579d34bf8dc4a10bba4d03b83678c55ee

See more details on using hashes here.

File details

Details for the file lmquant-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: lmquant-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 169.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.9 Linux/5.15.0-102-generic

File hashes

Hashes for lmquant-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b9f181bc751ea40f0f98a743952406ff56f890825447d564fc037cff7e41ff30
MD5 0f854dbc5a34076c3e8ff06cd5a4a0f8
BLAKE2b-256 b954285d56882438acea2df993e560a661bbd1a77b892532463234e76c638cb2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page