AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Project description
AutoAWQ
| Roadmap | Examples | Issues: Help Wanted |
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
Latest News 🔥
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Install
Requirements:
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
- CUDA Toolkit 11.8 and later.
Install:
- Use pip to install awq
pip install autoawq
Using conda
CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
Build source
Build AutoAWQ from scratch
Build time can take 10 minutes. Download your model while you install AutoAWQ.
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Supported models
The detailed support list:
Models | Sizes |
---|---|
LLaMA-2 | 7B/13B/70B |
LLaMA | 7B/13B/30B/65B |
Vicuna | 7B/13B |
MPT | 7B/30B |
Falcon | 7B/40B |
OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
Bloom | 560m/3B/7B/ |
GPTJ | 6.7B |
Usage
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.
INT4 GEMM vs INT4 GEMV vs FP16
There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:
- GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
- GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
- FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.
Examples
Quantization
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:"""
tokens = tokenizer(
prompt_template.format(prompt="How are you today?"),
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
quant_path
: Path to folder containing model files.quant_filename
: The filename to model weights orindex.json
file.max_new_tokens
: The max sequence length, used to allocate kv-cache for fused models.fuse_layers
: Whether or not to use fused layers.batch_size
: The batch size to initialize the AWQ model with.
Benchmarks
Vicuna 7B (LLaMa-2)
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMV
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 231.393 | 153.632 | 4.66 GB (19.68%) |
1 | 64 | 64 | 233.909 | 154.475 | 4.66 GB (19.68%) |
1 | 128 | 128 | 233.145 | 152.133 | 4.66 GB (19.68%) |
1 | 256 | 256 | 228.562 | 147.692 | 4.67 GB (19.72%) |
1 | 512 | 512 | 228.914 | 139.179 | 4.80 GB (20.26%) |
1 | 1024 | 1024 | 227.393 | 125.058 | 5.56 GB (23.48%) |
1 | 2048 | 2048 | 225.736 | 123.228 | 8.08 GB (34.09%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 521.444 | 126.51 | 4.55 GB (19.21%) |
1 | 64 | 64 | 2618.88 | 125.428 | 4.57 GB (19.31%) |
1 | 128 | 128 | 2808.09 | 123.865 | 4.61 GB (19.44%) |
1 | 256 | 256 | 2807.46 | 120.779 | 4.67 GB (19.72%) |
1 | 512 | 512 | 2769.9 | 115.08 | 4.80 GB (20.26%) |
1 | 1024 | 1024 | 2640.95 | 105.493 | 5.56 GB (23.48%) |
1 | 2048 | 2048 | 2341.36 | 104.188 | 8.08 GB (34.09%) |
MPT 7B
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
- Version: GEMV
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 187.332 | 136.765 | 3.65 GB (15.42%) |
1 | 64 | 64 | 241.026 | 136.476 | 3.67 GB (15.48%) |
1 | 128 | 128 | 239.44 | 137.599 | 3.70 GB (15.61%) |
1 | 256 | 256 | 233.184 | 137.02 | 3.76 GB (15.88%) |
1 | 512 | 512 | 233.082 | 135.633 | 3.89 GB (16.41%) |
1 | 1024 | 1024 | 231.504 | 122.197 | 4.40 GB (18.57%) |
1 | 2048 | 2048 | 228.307 | 121.468 | 5.92 GB (24.98%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 557.714 | 118.567 | 3.65 GB (15.42%) |
1 | 64 | 64 | 2752.9 | 120.772 | 3.67 GB (15.48%) |
1 | 128 | 128 | 2982.67 | 119.52 | 3.70 GB (15.61%) |
1 | 256 | 256 | 3009.16 | 116.911 | 3.76 GB (15.88%) |
1 | 512 | 512 | 2901.91 | 111.607 | 3.95 GB (16.68%) |
1 | 1024 | 1024 | 2718.68 | 102.623 | 4.40 GB (18.57%) |
1 | 2048 | 2048 | 2363.61 | 101.368 | 5.92 GB (24.98%) |
Falcon 7B
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
- Version: GEMM
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 466.826 | 95.1413 | 4.47 GB (18.88%) |
1 | 64 | 64 | 1920.61 | 94.5963 | 4.48 GB (18.92%) |
1 | 128 | 128 | 2406.1 | 94.793 | 4.48 GB (18.92%) |
1 | 256 | 256 | 2521.08 | 94.1144 | 4.48 GB (18.92%) |
1 | 512 | 512 | 2478.28 | 93.4123 | 4.48 GB (18.92%) |
1 | 1024 | 1024 | 2256.22 | 94.0237 | 4.69 GB (19.78%) |
1 | 2048 | 2048 | 1831.71 | 94.2032 | 6.83 GB (28.83%) |
Reference
If you find AWQ useful or relevant to your research, you can cite their paper:
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file autoawq-0.1.2-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 224.5 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b90f7a083364f9ecfe8edcebd87f873a6a4ac127fca5d0e11cfef1ab4312dba8 |
|
MD5 | c3932f9e0e64562836dcf6fba7e26734 |
|
BLAKE2b-256 | 00b6cce851c3cc088d0b322c6e8954845c800efb739dde0acaba59b611ad7f4d |
File details
Details for the file autoawq-0.1.2-cp311-cp311-manylinux2014_x86_64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp311-cp311-manylinux2014_x86_64.whl
- Upload date:
- Size: 17.5 MB
- Tags: CPython 3.11
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4c867f87a223f9ae4dadb04fa17b4b27298df0b9ac106e6371765bcd630403f |
|
MD5 | de86ca99b988fbe55b3d0eac7cdcd64c |
|
BLAKE2b-256 | 9931f75902b88d88c1997a8845717b523770b8507bccf41adbc059a692bd9eb0 |
File details
Details for the file autoawq-0.1.2-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 223.4 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f6b21026d3d868a7ba84964616c0b7fc3bfde593ad7f9029bcd8d3240adf386 |
|
MD5 | 6814dce8324c1d750a0e2591aa79f6cc |
|
BLAKE2b-256 | 892ca63c2787b69f03b985051ff2a2926b113b0b084b07f2d1d277e84485f8b8 |
File details
Details for the file autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl
- Upload date:
- Size: 17.4 MB
- Tags: CPython 3.10
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97568558d943de0922aef93e53b7006a44c40d7181e6a5892130b9a3a5d9c9c3 |
|
MD5 | cca619c377739c9983e6a13286135842 |
|
BLAKE2b-256 | 44f29e167e4cfe1def13fd4b5c5cbe0cbc8cb36aa9c39bb3763bbcea117d6e11 |
File details
Details for the file autoawq-0.1.2-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 223.5 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdb017015bdda59e84907b384109daa5bda8e8b0d36cdb86b9633b0db787e4d6 |
|
MD5 | 8a90dae711502256cbe90ba40906b0ff |
|
BLAKE2b-256 | d5578d36f971dd2ad6da9a4abdcf657a5a02f4cb90eca8222776e77e0faa74c2 |
File details
Details for the file autoawq-0.1.2-cp39-cp39-manylinux2014_x86_64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 17.4 MB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46452b116254cc9a3bcc9888ceeb3dd4c8a128bf30d42cc3afe367f4285e13cb |
|
MD5 | f0a7180d4e3c910c7177de4170a0c524 |
|
BLAKE2b-256 | 060800aaa831c1534ea930416b2fc0b170165394aa12fe1d84022fd7f84fa4f7 |
File details
Details for the file autoawq-0.1.2-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 223.4 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6abe7647403316309ea37d71fd147b2c85fef38144bf6b2bfd00fda78f099ad3 |
|
MD5 | b8d6026457808485321d55494dd85931 |
|
BLAKE2b-256 | 44d2f4ad135568af17acbe43ec9693adc008227f59331ae5c85ef3b5a17dc658 |
File details
Details for the file autoawq-0.1.2-cp38-cp38-manylinux2014_x86_64.whl
.
File metadata
- Download URL: autoawq-0.1.2-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 17.4 MB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1a4bc3c05f774145afe58ed3d90d56f16891f4f4935997ea709da89d57f77d7 |
|
MD5 | 442eefb3365bc0664e8701c1d7598d9e |
|
BLAKE2b-256 | fe942df24a654400ce74289d453fca824bd1acd04c85edb7a4705082ab278fa6 |