AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.
Project description
AutoAWQ
| Roadmap | Examples | Issues: Help Wanted |
AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.
Latest News 🔥
- [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
- [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
- [2023/08] PyPi package released and AutoModel class available
Install
Requirements:
- Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
- CUDA Toolkit 11.8 and later.
Install:
- Use pip to install awq
pip install autoawq
Using conda
CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:
conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq
Build source
Build AutoAWQ from scratch
Build time can take 10 minutes. Download your model while you install AutoAWQ.
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .
Supported models
The detailed support list:
Models | Sizes |
---|---|
LLaMA-2 | 7B/13B/70B |
LLaMA | 7B/13B/30B/65B |
Vicuna | 7B/13B |
MPT | 7B/30B |
Falcon | 7B/40B |
OPT | 125m/1.3B/2.7B/6.7B/13B/30B |
Bloom | 560m/3B/7B/ |
GPTJ | 6.7B |
Usage
Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.
INT4 GEMM vs INT4 GEMV vs FP16
There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:
- GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
- GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
- FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.
Examples
Quantization
Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer
quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"
# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
USER: {prompt}
ASSISTANT:"""
tokens = tokenizer(
prompt_template.format(prompt="How are you today?"),
return_tensors='pt'
).input_ids.cuda()
# Generate output
generation_output = model.generate(
tokens,
streamer=streamer,
max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
quant_path
: Path to folder containing model files.quant_filename
: The filename to model weights orindex.json
file.max_new_tokens
: The max sequence length, used to allocate kv-cache for fused models.fuse_layers
: Whether or not to use fused layers.batch_size
: The batch size to initialize the AWQ model with.
Benchmarks
Vicuna 7B (LLaMa-2)
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMV
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 231.393 | 153.632 | 4.66 GB (19.68%) |
1 | 64 | 64 | 233.909 | 154.475 | 4.66 GB (19.68%) |
1 | 128 | 128 | 233.145 | 152.133 | 4.66 GB (19.68%) |
1 | 256 | 256 | 228.562 | 147.692 | 4.67 GB (19.72%) |
1 | 512 | 512 | 228.914 | 139.179 | 4.80 GB (20.26%) |
1 | 1024 | 1024 | 227.393 | 125.058 | 5.56 GB (23.48%) |
1 | 2048 | 2048 | 225.736 | 123.228 | 8.08 GB (34.09%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 521.444 | 126.51 | 4.55 GB (19.21%) |
1 | 64 | 64 | 2618.88 | 125.428 | 4.57 GB (19.31%) |
1 | 128 | 128 | 2808.09 | 123.865 | 4.61 GB (19.44%) |
1 | 256 | 256 | 2807.46 | 120.779 | 4.67 GB (19.72%) |
1 | 512 | 512 | 2769.9 | 115.08 | 4.80 GB (20.26%) |
1 | 1024 | 1024 | 2640.95 | 105.493 | 5.56 GB (23.48%) |
1 | 2048 | 2048 | 2341.36 | 104.188 | 8.08 GB (34.09%) |
MPT 7B
- Note: Blazing fast generation, slow context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
- Version: GEMV
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 187.332 | 136.765 | 3.65 GB (15.42%) |
1 | 64 | 64 | 241.026 | 136.476 | 3.67 GB (15.48%) |
1 | 128 | 128 | 239.44 | 137.599 | 3.70 GB (15.61%) |
1 | 256 | 256 | 233.184 | 137.02 | 3.76 GB (15.88%) |
1 | 512 | 512 | 233.082 | 135.633 | 3.89 GB (16.41%) |
1 | 1024 | 1024 | 231.504 | 122.197 | 4.40 GB (18.57%) |
1 | 2048 | 2048 | 228.307 | 121.468 | 5.92 GB (24.98%) |
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Version: GEMM
- Command:
python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 557.714 | 118.567 | 3.65 GB (15.42%) |
1 | 64 | 64 | 2752.9 | 120.772 | 3.67 GB (15.48%) |
1 | 128 | 128 | 2982.67 | 119.52 | 3.70 GB (15.61%) |
1 | 256 | 256 | 3009.16 | 116.911 | 3.76 GB (15.88%) |
1 | 512 | 512 | 2901.91 | 111.607 | 3.95 GB (16.68%) |
1 | 1024 | 1024 | 2718.68 | 102.623 | 4.40 GB (18.57%) |
1 | 2048 | 2048 | 2363.61 | 101.368 | 5.92 GB (24.98%) |
Falcon 7B
- Note: Fast generation, fast context processing
- GPU: NVIDIA GeForce RTX 3090
- Command:
python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
- Version: GEMM
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
---|---|---|---|---|---|
1 | 32 | 32 | 466.826 | 95.1413 | 4.47 GB (18.88%) |
1 | 64 | 64 | 1920.61 | 94.5963 | 4.48 GB (18.92%) |
1 | 128 | 128 | 2406.1 | 94.793 | 4.48 GB (18.92%) |
1 | 256 | 256 | 2521.08 | 94.1144 | 4.48 GB (18.92%) |
1 | 512 | 512 | 2478.28 | 93.4123 | 4.48 GB (18.92%) |
1 | 1024 | 1024 | 2256.22 | 94.0237 | 4.69 GB (19.78%) |
1 | 2048 | 2048 | 1831.71 | 94.2032 | 6.83 GB (28.83%) |
Reference
If you find AWQ useful or relevant to your research, you can cite their paper:
@article{lin2023awq,
title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
journal={arXiv},
year={2023}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for autoawq-0.1.3-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca94be6dbe4d97e715b3c70017680a767264746e737a00b339455fb519280585 |
|
MD5 | 6ca9cc0c9ff3ae0b098f453c0cb7242b |
|
BLAKE2b-256 | fd1d0f3bea35aea9d8a646fefb098c78dc5af0260203549710f47d0a75420b04 |
Hashes for autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c674adcbfea3af574bd03da2fe2a74f1ca274b57fda0e59d8364425469a06783 |
|
MD5 | a364087efbfb4edf77fb53a79521005f |
|
BLAKE2b-256 | 15e871b6e90c495d0a53cde93cfde11e4c965614f890bc50de784958cc0b0fea |
Hashes for autoawq-0.1.3-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1918d8a26d01ff8127a2e9271026be342406b6d6c02141c3aa611617266bb4c2 |
|
MD5 | 90c5adf73e82350304870676b8f14e11 |
|
BLAKE2b-256 | 49785acb52da285bba7aef386b9e7d4773b52fcb2230a20da44ffde2c8e6511c |
Hashes for autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 785c33d6493130aaec9d0af341d5aafbfa490df291c8b0118e68d74564360280 |
|
MD5 | 27d2756e221ca37eb88292b3bbf8c9b9 |
|
BLAKE2b-256 | 9621a66c36c632951facda8a9cfbb2a25bb28558bfb4dd1acec9899e8a60daa3 |
Hashes for autoawq-0.1.3-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 96876670078c6e83748553f5c659dfd7043d69839b0af0535b261c2e2ca192dd |
|
MD5 | 075d12837aa0e39766c2df96aa2c74a3 |
|
BLAKE2b-256 | 6aea96432ac13696dd3c082896b3cf095e80c5b589e86067f7527b0008587eec |
Hashes for autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 175334d5429779333a1881faa7cd0628ab3c3c0eb82af09094b97d103dfc1c08 |
|
MD5 | 88aa9409846025fc5f29d70230052a2a |
|
BLAKE2b-256 | 13fef6c2ee6b1d8ae04ee2520bdcfc61414b8f98c5975e5eb2caa5c6700d1a59 |
Hashes for autoawq-0.1.3-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | afff79a385ba7384d8aab366585e5c2fd8179d0660266b2384859a89018d48a0 |
|
MD5 | 664b477691a5cd2dace67a1a45943994 |
|
BLAKE2b-256 | 8c29c893f096b6f29a72c743b7b6930e4fdca2e273c0a203ab3ccdf34d68da79 |
Hashes for autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 704127b46fce550669d9ac7b1cfb626b9d226bf42062ed946bdbd3f2d285afb6 |
|
MD5 | fb33fdf4d5dde0d12de67360f0f57ffa |
|
BLAKE2b-256 | f45b50f79271a6f84b0acab172d6bd5478c72e6743e136fc5916f870e4c6a59b |