Skip to main content

This package is used for evaluating large foundation models quantization in deep learning.

Project description

Large Models Quantization (LMQuant)

LMQuant is an open source large models quantization toolbox based on PyTorch. It supports accuracy evaluation with simulated pseudo quantization and dumps quantized weights with scaling factors for system evalation. Apart from LMQuant, we also released QServe for efficient GPU inference of large language models.

The current release supports:

  • SmoothQuant, AWQ, GPTQ-R, and QoQ quantization for large language models

News

  • [2024/05] 🔥 We released QoQ LLM quantization code.

Contents

Installation

  1. Clone this repository and navigate to lmquant folder
git clone https://github.com/mit-han-lab/lmquant
cd lmquant
  1. Install Package
conda env create -f environment.yml
poetry install

Highlights

QoQ: W4A8KV4 Quantization for Efficient LLM Serving

[Paper][Code]

We introduce W4A8KV4 (4-bit weights, 8-bit activations, and 4-bit KV cache) quantization algorithm QoQ and inference system QServe to accelerate LLM serving on GPUs. The key insight is that the efficiency of LLM serving on GPUs is mainly influenced by operations on low-throughput CUDA cores rather than high-throughput tensor cores. Building upon this insight, we propose progressive quantization that can substantially reduce weight dequantization cost through register-level parallelism and a subtraction after multiplication computation order. We also develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. As a result, we improve the serving throughput of popular LLMs by up to 2.4× on A100, 3.5× on L40S, surpassing the leading industry solution TensorRT-LLM. Remarkably, our system on L40S GPU can achieve even higher throughput compared to TensorRT-LLM on A100, effectively reducing the dollar cost of LLM serving by 3×.

QoQ

Support List

Large Language Model Quantization

Models Sizes QoQ (W4A8KV8) AWQ (W4A16) GPTQ-R (W4A16) SmoothQuant (W8A8)
Llama3 8B/70B
Llama2 7B/13B/70B
Llama 7B/13B/30B
Mistral 7B
Mixtral 8x7B
Yi 34B

Reference

If you find lmquant useful or relevant to your research, please kindly cite our paper:

@article{lin2024qserve,
  title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
  author={Lin, Yujun and Tang, Haotian and Yang, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
  year={2024}
}

Related Projects

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmquant-0.0.0.tar.gz (127.1 kB view hashes)

Uploaded Source

Built Distribution

lmquant-0.0.0-py3-none-any.whl (169.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page