Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/12] Mixtral, LLaVa, QWen, Baichuan model support.
  • [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Prerequisites

  • Your GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
  • Your CUDA version must be CUDA 11.8 or later.

Install from PyPi

To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.

pip install autoawq

If you cannot use CUDA 12.1, you can still use CUDA 11.8 and install the wheel from the latest release.

pip install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

Build from source

Build time can take 10-20 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Mistral 7B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B
Aquila 7B
Aquila2 7B/34B
Yi 6B/34B
Qwen 1.8B/7B/14B/72B
BigCode 1B/7B/15B
GPT NeoX 20B
GPT-J 6B
LLaVa 7B/13B
Mixtral 8x7B
Baichuan 7B/13B
QWen 1.8B/7B/14/72B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): 20% faster than GEMM, only batch size 1 (not good for large context).
  • GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
  • FP16 (non-quantized): Recommended for highest throughput: vLLM.

Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication.

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with model.generate() and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

  • Fused modules are activated when you use fuse_layers=True.
  • A custom cache is implemented. It preallocates based on batch size and sequence length.
    • You cannot change the sequence length after you have created your model.
    • Reference: AutoAWQForCausalLM.from_quantized(max_new_tokens=seq_len, batch_size=batch_size)
  • The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
  • The past_key_values from model.generate() are only dummy values, so they cannot be used after generation.

Examples

More examples can be found in the examples directory.

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokens = tokenizer(
    prompt_template.format(prompt=prompt), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)

Benchmarks

These benchmarks showcase the speed and memory usage of processing context (prefill) and generating tokens (decoding). The results include speed at various batch sizes and different versions of AWQ kernels. We have aimed to test models fairly using the same benchmarking tool that you can use to reproduce the results. Do note that speed may vary not only between GPUs but also between CPUs. What matters most is a GPU with high memory bandwidth and a CPU with high single core clock speed.

  • Tested with AutoAWQ version 0.1.6
  • GPU: RTX 4090 (AMD Ryzen 9 7950X)
  • Command: python examples/benchmark.py --model_path <hf_model> --batch_size 1
  • 🟢 for GEMV, 🔵 for GEMM, 🔴 for avoid using
Model Name Size Version Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
Vicuna 7B 🟢GEMV 1 64 64 639.65 198.848 4.50 GB (19.05%)
Vicuna 7B 🟢GEMV 1 2048 2048 1123.63 133.191 6.15 GB (26.02%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🔵GEMM 1 64 64 1093.35 156.317 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 1 2048 2048 3897.02 114.355 5.55 GB (23.48%)
Mistral 7B 🔵GEMM 8 64 64 4199.18 1185.25 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 8 2048 2048 3661.46 829.754 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🟢GEMV 1 64 64 531.99 188.29 4.28 GB (18.08%)
Mistral 7B 🟢GEMV 1 2048 2048 903.83 130.66 5.55 GB (23.48%)
Mistral 7B 🔴GEMV 8 64 64 897.87 486.46 4.33 GB (18.31%)
Mistral 7B 🔴GEMV 8 2048 2048 884.22 411.893 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
TinyLlama 1B 🟢GEMV 1 64 64 1088.63 548.993 0.86 GB (3.62%)
TinyLlama 1B 🟢GEMV 1 2048 2048 5178.98 431.468 2.10 GB (8.89%)
... ... ... ... ... ... ... ... ...
Llama 2 13B 🔵GEMM 1 64 64 820.34 96.74 8.47 GB (35.83%)
Llama 2 13B 🔵GEMM 1 2048 2048 2279.41 73.8213 10.28 GB (43.46%)
Llama 2 13B 🔵GEMM 3 64 64 1593.88 286.249 8.57 GB (36.24%)
Llama 2 13B 🔵GEMM 3 2048 2048 2226.7 189.573 16.90 GB (71.47%)
... ... ... ... ... ... ... ... ...
MPT 7B 🔵GEMM 1 64 64 1079.06 161.344 3.67 GB (15.51%)
MPT 7B 🔵GEMM 1 2048 2048 4069.78 114.982 5.87 GB (24.82%)
... ... ... ... ... ... ... ... ...
Falcon 7B 🔵GEMM 1 64 64 1139.93 133.585 4.47 GB (18.92%)
Falcon 7B 🔵GEMM 1 2048 2048 2850.97 115.73 6.83 GB (28.88%)
... ... ... ... ... ... ... ... ...
CodeLlama 34B 🔵GEMM 1 64 64 681.74 41.01 19.05 GB (80.57%)
CodeLlama 34B 🔵GEMM 1 2048 2048 1072.36 35.8316 20.26 GB (85.68%)
... ... ... ... ... ... ... ... ...
DeepSeek 33B 🔵GEMM 1 64 64 1160.18 40.29 18.92 GB (80.00%)
DeepSeek 33B 🔵GEMM 1 2048 2048 1012.1 34.0093 19.87 GB (84.02%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

autoawq-0.1.8-cp311-cp311-win_amd64.whl (254.5 kB view details)

Uploaded CPython 3.11Windows x86-64

autoawq-0.1.8-cp311-cp311-manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.11

autoawq-0.1.8-cp310-cp310-win_amd64.whl (252.9 kB view details)

Uploaded CPython 3.10Windows x86-64

autoawq-0.1.8-cp310-cp310-manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.10

autoawq-0.1.8-cp39-cp39-win_amd64.whl (252.5 kB view details)

Uploaded CPython 3.9Windows x86-64

autoawq-0.1.8-cp39-cp39-manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.9

autoawq-0.1.8-cp38-cp38-win_amd64.whl (252.8 kB view details)

Uploaded CPython 3.8Windows x86-64

autoawq-0.1.8-cp38-cp38-manylinux2014_x86_64.whl (20.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.8-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.8-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 254.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.8-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 13917ef06d261e5deb915a567bdd3aa78e33e565fba45955d1291e055980c926
MD5 645379d51c9db72a53608ded9bd9eccd
BLAKE2b-256 d114aa464e89613711b20187e66e035cbc217521a6d91dbf6678c17bd13c3f87

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.8-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 58df2fcdb94eb253656f5ba4492de32604d03a81383a09b095d8e77cd9b9fc38
MD5 c5026b37144c7e215b31edc9d274ca39
BLAKE2b-256 98398bc5a2212acd89e8c4a4ab6b65454367238dfe9b745333473ac903c1f55b

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.8-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 252.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.8-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7b8af15dfa9468eb82304ec911e1a2ac05a9fba04cfd408b59cf60a87a036c49
MD5 79d7425d7e1f7c57ba8ddb1005efe65c
BLAKE2b-256 01cfc96211c25a1bd1a5bcc25fd9608911665244c4380d9b0c08d8329522c133

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.8-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2b1c6f3274a97aaa97dc4133b66265a813fde60f0abe228584eddaf2761904a1
MD5 66f53f7a1ef5ca6226061a9c3c2b2e13
BLAKE2b-256 f7b538291e9ef67ef7d6146339111298fab4dd76778ca59a3511af2663ed4b9c

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.8-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 252.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.8-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 239c5ea407833ea24280372a3e346d41c4b9576954a3d3c2e2f27432e38cef99
MD5 dd642cb121ded9943b5406212d5745fb
BLAKE2b-256 14daa8ce81f9ee68a60473e5ddce989ae4fbab7b1870f2ae76d3f3eb87d0078b

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.8-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d54fc906bd6b4fd8d7f9bb0855bb780b91e0132e7bf2716b0e90e3dcd5c8fd56
MD5 e99d120eb9b94eb18de3b780ed5670cd
BLAKE2b-256 8537808ed42094e60c28aebe65dd821f21c4bfaf2d0e84565ce289b7cbc6045f

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.8-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 252.8 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.8-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 d1b0b0caf75763a9f37337030e54079b9ef11b833593df5e6aecfc6997d5df1f
MD5 a3fdb4bb4cb38e9354299c616744dd4f
BLAKE2b-256 31e349be4b13550fa037b6b342b6c6449c2cc4f89a03e2660c17fdc0669100f9

See more details on using hashes here.

File details

Details for the file autoawq-0.1.8-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.8-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d49acfffbd415c24600d6433928ebe8206acff06960decd54e25e56654534d77
MD5 c1fb0e9d9537006e8d22f4efd968e591
BLAKE2b-256 f0656fb71bc7bbe7e7f643f9d0a93ddcfe5498012baaf5f1df7e6b019ea38980

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page