AutoAWQ
| Roadmap | Examples | Issues: Help Wanted |
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
Latest News 🔥
- [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Install
Requirements:
- Compute Capability 7.5 (sm75). Turing and later architectures are supported.
- CUDA Toolkit 11.8 and later.
Install:
pip install autoawq
Using conda
CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
Build source
Build AutoAWQ from scratch
Build time can take 10 minutes. Download your model while you install AutoAWQ.
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Supported models
The detailed support list:
Models |
Sizes |
LLaMA-2 |
7B/13B/70B |
LLaMA |
7B/13B/30B/65B |
Vicuna |
7B/13B |
MPT |
7B/30B |
Falcon |
7B/40B |
OPT |
125m/1.3B/2.7B/6.7B/13B/30B |
Bloom |
560m/3B/7B/ |
GPTJ |
6.7B |
Aquila |
7B |
Aquila2 |
7B/34B |
Usage
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.
INT4 GEMM vs INT4 GEMV vs FP16
There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:
- GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
- GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
- FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.
Examples
Quantization
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:"""
tokens = tokenizer(
prompt_template.format(prompt="How are you today?"),
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
quant_path
: Path to folder containing model files.
quant_filename
: The filename to model weights or index.json
file.
max_new_tokens
: The max sequence length, used to allocate kv-cache for fused models.
fuse_layers
: Whether or not to use fused layers.
batch_size
: The batch size to initialize the AWQ model with.
Benchmarks
Vicuna 7B (LLaMa-2)
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMV
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
231.393 |
153.632 |
4.66 GB (19.68%) |
1 |
64 |
64 |
233.909 |
154.475 |
4.66 GB (19.68%) |
1 |
128 |
128 |
233.145 |
152.133 |
4.66 GB (19.68%) |
1 |
256 |
256 |
228.562 |
147.692 |
4.67 GB (19.72%) |
1 |
512 |
512 |
228.914 |
139.179 |
4.80 GB (20.26%) |
1 |
1024 |
1024 |
227.393 |
125.058 |
5.56 GB (23.48%) |
1 |
2048 |
2048 |
225.736 |
123.228 |
8.08 GB (34.09%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
521.444 |
126.51 |
4.55 GB (19.21%) |
1 |
64 |
64 |
2618.88 |
125.428 |
4.57 GB (19.31%) |
1 |
128 |
128 |
2808.09 |
123.865 |
4.61 GB (19.44%) |
1 |
256 |
256 |
2807.46 |
120.779 |
4.67 GB (19.72%) |
1 |
512 |
512 |
2769.9 |
115.08 |
4.80 GB (20.26%) |
1 |
1024 |
1024 |
2640.95 |
105.493 |
5.56 GB (23.48%) |
1 |
2048 |
2048 |
2341.36 |
104.188 |
8.08 GB (34.09%) |
MPT 7B
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
- Version: GEMV
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
187.332 |
136.765 |
3.65 GB (15.42%) |
1 |
64 |
64 |
241.026 |
136.476 |
3.67 GB (15.48%) |
1 |
128 |
128 |
239.44 |
137.599 |
3.70 GB (15.61%) |
1 |
256 |
256 |
233.184 |
137.02 |
3.76 GB (15.88%) |
1 |
512 |
512 |
233.082 |
135.633 |
3.89 GB (16.41%) |
1 |
1024 |
1024 |
231.504 |
122.197 |
4.40 GB (18.57%) |
1 |
2048 |
2048 |
228.307 |
121.468 |
5.92 GB (24.98%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
557.714 |
118.567 |
3.65 GB (15.42%) |
1 |
64 |
64 |
2752.9 |
120.772 |
3.67 GB (15.48%) |
1 |
128 |
128 |
2982.67 |
119.52 |
3.70 GB (15.61%) |
1 |
256 |
256 |
3009.16 |
116.911 |
3.76 GB (15.88%) |
1 |
512 |
512 |
2901.91 |
111.607 |
3.95 GB (16.68%) |
1 |
1024 |
1024 |
2718.68 |
102.623 |
4.40 GB (18.57%) |
1 |
2048 |
2048 |
2363.61 |
101.368 |
5.92 GB (24.98%) |
Falcon 7B
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
- Version: GEMM
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
466.826 |
95.1413 |
4.47 GB (18.88%) |
1 |
64 |
64 |
1920.61 |
94.5963 |
4.48 GB (18.92%) |
1 |
128 |
128 |
2406.1 |
94.793 |
4.48 GB (18.92%) |
1 |
256 |
256 |
2521.08 |
94.1144 |
4.48 GB (18.92%) |
1 |
512 |
512 |
2478.28 |
93.4123 |
4.48 GB (18.92%) |
1 |
1024 |
1024 |
2256.22 |
94.0237 |
4.69 GB (19.78%) |
1 |
2048 |
2048 |
1831.71 |
94.2032 |
6.83 GB (28.83%) |
Aquila2 34B
- Note: Fast generation, fast context processing
- GPU: NVIDIA A100-SXM4-40GB
- Command:
python examples/benchmark.py --model_path casperhansen/aquilachat2-34b-awq --quant_file pytorch_model.bin.index.json
- Version: GEMM
Batch Size |
Prefill Length |
Decode Length |
Prefill tokens/s |
Decode tokens/s |
Memory (VRAM) |
1 |
32 |
32 |
36.7505 |
23.423 |
18.26 GB (46.12%) |
1 |
64 |
64 |
516.544 |
23.3536 |
18.26 GB (46.12%) |
1 |
128 |
128 |
643.968 |
23.3803 |
18.26 GB (46.12%) |
1 |
256 |
256 |
736.236 |
23.389 |
18.34 GB (46.32%) |
1 |
512 |
512 |
829.405 |
23.3889 |
18.54 GB (46.84%) |
1 |
1024 |
1024 |
836.023 |
23.3757 |
18.95 GB (47.87%) |
1 |
2048 |
2048 |
802.632 |
23.3777 |
20.25 GB (51.15%) |
1 |
4096 |
4096 |
722.49 |
23.4252 |
25.38 GB (64.12%) |
Reference
If you find AWQ useful or relevant to your research, you can cite their paper:
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}