AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Project description
AutoAWQ
| Roadmap | Examples | Issues: Help Wanted |
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
Latest News 🔥
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Install
Requirements:
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
- CUDA Toolkit 11.8 and later.
Install:
- Use pip to install awq
pip install autoawq
Using conda
CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
Build source
Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Supported models
The detailed support list:
Models | Sizes |
---|---|
LLaMA-2 | 7B/13B/70B |
LLaMA | 7B/13B/30B/65B |
Vicuna | 7B/13B |
MPT | 7B/30B |
Falcon | 7B/40B |
OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
Bloom | 560m/3B/7B/ |
GPTJ | 6.7B |
Usage
Below, you will find examples of how to easily quantize a model and run inference.
Quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
Run inference on a quantized model from Huggingface:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
model.generate(...)
Benchmarks
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
Model | GPU | FP16 latency (ms) | INT4 latency (ms) | Speedup |
---|---|---|---|---|
LLaMA-2-7B | 4090 | 19.97 | 8.66 | 2.31x |
LLaMA-2-13B | 4090 | OOM | 13.54 | -- |
Vicuna-7B | 4090 | 19.09 | 8.61 | 2.22x |
Vicuna-13B | 4090 | OOM | 12.17 | -- |
MPT-7B | 4090 | 17.09 | 12.58 | 1.36x |
MPT-30B | 4090 | OOM | 23.54 | -- |
Falcon-7B | 4090 | 29.91 | 19.84 | 1.51x |
LLaMA-2-7B | A6000 | 27.14 | 12.44 | 2.18x |
LLaMA-2-13B | A6000 | 47.28 | 20.28 | 2.33x |
Vicuna-7B | A6000 | 26.06 | 12.43 | 2.10x |
Vicuna-13B | A6000 | 44.91 | 17.30 | 2.60x |
MPT-7B | A6000 | 22.79 | 16.87 | 1.35x |
MPT-30B | A6000 | OOM | 31.57 | -- |
Falcon-7B | A6000 | 39.44 | 27.34 | 1.44x |
Detailed benchmark (CPU vs. GPU)
Here is the difference between a fast and slow CPU on MPT-7B:
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Reference
If you find AWQ useful or relevant to your research, you can cite their paper:
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for autoawq-0.0.2-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | acc445f15f1ab24db58ccfe211dd70467768cca7b49953630cc291e2921bcbe0 |
|
MD5 | 9aae639e72edab48d8c4501a22e6acf4 |
|
BLAKE2b-256 | 88158d1fd538c040923c34c6f44d1ea0da98109ac942991f65bed09e5432da45 |
Hashes for autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 990d25d4ef139984aa3504962959eaddc429988c42bb91b769af1411fb96bf3c |
|
MD5 | 2e8dd11c779223e2f3bf35f2747518f8 |
|
BLAKE2b-256 | 2fe4fb9a8b3db9246849ca5f4f16b92f57def52ab70c719f76e454d7955d953c |
Hashes for autoawq-0.0.2-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 418bc2df210d84cb9f055890e26c8d9d67b961fdbf1efef1f79f0727b7832f6a |
|
MD5 | fb40a2de82d9528e576becc2ffab381b |
|
BLAKE2b-256 | 4e6ed761f6c7267106b19d2b5db4a345bba170d67ab04c502b0c1d914e6062f8 |
Hashes for autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 814099f41aab7ae87d8e9cefbb109b692ed01854e76d34e19bcaca45c40c258b |
|
MD5 | 98d8884280ba8290ff307a73348b3dc3 |
|
BLAKE2b-256 | 9736da85da021983420d9386d817fe28685248405bd222ac6fea76fe898a6b37 |
Hashes for autoawq-0.0.2-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 71b71642ec01de50d161b8f694f5e466a0a79f2f4901b595422ba9dafd1b04b2 |
|
MD5 | db4a5e5c4a340fa931eaaf8de9568f20 |
|
BLAKE2b-256 | 8f4571ea459f3cea0ae97a2bfffeb38fbec84b03e3dc51f1d976f4b89706bbe2 |
Hashes for autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec8ea5bd2170bb9ab56b33cc035334eea6278d4b06781eaceee3a6f480e754f6 |
|
MD5 | 31efb657d863cde7e9b5b28de82c482e |
|
BLAKE2b-256 | fbebedd71083eee76ac8c65dbef35992d09ffcd4d0cec3ec32d93ed493c7229c |
Hashes for autoawq-0.0.2-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e877a59d6f59e27112db61b682dfd0a042ae862640879c43affa9afc5f8d4f53 |
|
MD5 | fce13db0f9d3e67e68fe5617edfbf1c1 |
|
BLAKE2b-256 | b0ecc67c118b408ad61be53347fed4983d5e001e4cab21c4eff0de9ff83ad3de |
Hashes for autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1d8740f4da789e3bf1f8960d692f1281a7604eb0c50bccabfc65c2102909cad |
|
MD5 | f35f4082a02f80384e878c06f1b5d74e |
|
BLAKE2b-256 | 137f877466164b17a354553b9cb50803489c62bb90b31ef5ea93d54a7d28a5d2 |