AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Project description
AutoAWQ
| Roadmap | Examples | Issues: Help Wanted |
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
Latest News 🔥
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Install
Requirements:
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
- CUDA Toolkit 11.8 and later.
Install:
- Use pip to install awq
pip install autoawq
Using conda
CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
Build source
Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Supported models
The detailed support list:
| Models | Sizes |
|---|---|
| LLaMA-2 | 7B/13B/70B |
| LLaMA | 7B/13B/30B/65B |
| Vicuna | 7B/13B |
| MPT | 7B/30B |
| Falcon | 7B/40B |
| OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
| Bloom | 560m/3B/7B/ |
| GPTJ | 6.7B |
Usage
Below, you will find examples of how to easily quantize a model and run inference.
Quantization
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
Run inference on a quantized model from Huggingface:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
model.generate(...)
Benchmarks
Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.
| Model | GPU | FP16 latency (ms) | INT4 latency (ms) | Speedup |
|---|---|---|---|---|
| LLaMA-2-7B | 4090 | 19.97 | 8.66 | 2.31x |
| LLaMA-2-13B | 4090 | OOM | 13.54 | -- |
| Vicuna-7B | 4090 | 19.09 | 8.61 | 2.22x |
| Vicuna-13B | 4090 | OOM | 12.17 | -- |
| MPT-7B | 4090 | 17.09 | 12.58 | 1.36x |
| MPT-30B | 4090 | OOM | 23.54 | -- |
| Falcon-7B | 4090 | 29.91 | 19.84 | 1.51x |
| LLaMA-2-7B | A6000 | 27.14 | 12.44 | 2.18x |
| LLaMA-2-13B | A6000 | 47.28 | 20.28 | 2.33x |
| Vicuna-7B | A6000 | 26.06 | 12.43 | 2.10x |
| Vicuna-13B | A6000 | 44.91 | 17.30 | 2.60x |
| MPT-7B | A6000 | 22.79 | 16.87 | 1.35x |
| MPT-30B | A6000 | OOM | 31.57 | -- |
| Falcon-7B | A6000 | 39.44 | 27.34 | 1.44x |
Detailed benchmark (CPU vs. GPU)
Here is the difference between a fast and slow CPU on MPT-7B:
RTX 4090 + Intel i9 13900K (2 different VMs):
- CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
- CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)
RTX 4090 + AMD EPYC 7-Series (3 different VMs):
- CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
- CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
- CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)
Reference
If you find AWQ useful or relevant to your research, you can cite their paper:
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autoawq-0.0.2-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 184.6 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
acc445f15f1ab24db58ccfe211dd70467768cca7b49953630cc291e2921bcbe0
|
|
| MD5 |
9aae639e72edab48d8c4501a22e6acf4
|
|
| BLAKE2b-256 |
88158d1fd538c040923c34c6f44d1ea0da98109ac942991f65bed09e5432da45
|
File details
Details for the file autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.11
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
990d25d4ef139984aa3504962959eaddc429988c42bb91b769af1411fb96bf3c
|
|
| MD5 |
2e8dd11c779223e2f3bf35f2747518f8
|
|
| BLAKE2b-256 |
2fe4fb9a8b3db9246849ca5f4f16b92f57def52ab70c719f76e454d7955d953c
|
File details
Details for the file autoawq-0.0.2-cp310-cp310-win_amd64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 183.9 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
418bc2df210d84cb9f055890e26c8d9d67b961fdbf1efef1f79f0727b7832f6a
|
|
| MD5 |
fb40a2de82d9528e576becc2ffab381b
|
|
| BLAKE2b-256 |
4e6ed761f6c7267106b19d2b5db4a345bba170d67ab04c502b0c1d914e6062f8
|
File details
Details for the file autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.10
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
814099f41aab7ae87d8e9cefbb109b692ed01854e76d34e19bcaca45c40c258b
|
|
| MD5 |
98d8884280ba8290ff307a73348b3dc3
|
|
| BLAKE2b-256 |
9736da85da021983420d9386d817fe28685248405bd222ac6fea76fe898a6b37
|
File details
Details for the file autoawq-0.0.2-cp39-cp39-win_amd64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 183.9 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
71b71642ec01de50d161b8f694f5e466a0a79f2f4901b595422ba9dafd1b04b2
|
|
| MD5 |
db4a5e5c4a340fa931eaaf8de9568f20
|
|
| BLAKE2b-256 |
8f4571ea459f3cea0ae97a2bfffeb38fbec84b03e3dc51f1d976f4b89706bbe2
|
File details
Details for the file autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec8ea5bd2170bb9ab56b33cc035334eea6278d4b06781eaceee3a6f480e754f6
|
|
| MD5 |
31efb657d863cde7e9b5b28de82c482e
|
|
| BLAKE2b-256 |
fbebedd71083eee76ac8c65dbef35992d09ffcd4d0cec3ec32d93ed493c7229c
|
File details
Details for the file autoawq-0.0.2-cp38-cp38-win_amd64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 183.6 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e877a59d6f59e27112db61b682dfd0a042ae862640879c43affa9afc5f8d4f53
|
|
| MD5 |
fce13db0f9d3e67e68fe5617edfbf1c1
|
|
| BLAKE2b-256 |
b0ecc67c118b408ad61be53347fed4983d5e001e4cab21c4eff0de9ff83ad3de
|
File details
Details for the file autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl.
File metadata
- Download URL: autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 3.4 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1d8740f4da789e3bf1f8960d692f1281a7604eb0c50bccabfc65c2102909cad
|
|
| MD5 |
f35f4082a02f80384e878c06f1b5d74e
|
|
| BLAKE2b-256 |
137f877466164b17a354553b9cb50803489c62bb90b31ef5ea93d54a7d28a5d2
|