Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/12] Mixtral, LLaVa, QWen, Baichuan model support.
  • [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Prerequisites

  • NVIDIA:
    • Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
    • Your CUDA version must be CUDA 11.8 or later.
  • AMD:
    • Your ROCm version must be ROCm 5.6 or later.

Install from PyPi

To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.

pip install autoawq

Build from source

For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the release page:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl

Or from the main branch directly:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git

Or by cloning the repository and installing from source:

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

All three methods will install the latest and correct kernels for your system from AutoAWQ_Kernels.

If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in AutoAWQ_Kernels and then install AutoAWQ from source.

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): 20% faster than GEMM, only batch size 1 (not good for large context).
  • GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
  • FP16 (non-quantized): Recommended for highest throughput: vLLM.

Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication.

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with model.generate() and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

  • Fused modules are activated when you use fuse_layers=True.
  • A custom cache is implemented. It preallocates based on batch size and sequence length.
    • You cannot change the sequence length after you have created your model.
    • Reference: AutoAWQForCausalLM.from_quantized(max_seq_len=seq_len, batch_size=batch_size)
  • The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
  • The past_key_values from model.generate() are only dummy values, so they cannot be used after generation.

Examples

More examples can be found in the examples directory.

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokens = tokenizer(
    prompt_template.format(prompt=prompt), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_seq_len=512
)

Benchmarks

These benchmarks showcase the speed and memory usage of processing context (prefill) and generating tokens (decoding). The results include speed at various batch sizes and different versions of AWQ kernels. We have aimed to test models fairly using the same benchmarking tool that you can use to reproduce the results. Do note that speed may vary not only between GPUs but also between CPUs. What matters most is a GPU with high memory bandwidth and a CPU with high single core clock speed.

  • Tested with AutoAWQ version 0.1.6
  • GPU: RTX 4090 (AMD Ryzen 9 7950X)
  • Command: python examples/benchmark.py --model_path <hf_model> --batch_size 1
  • 🟢 for GEMV, 🔵 for GEMM, 🔴 for avoid using
Model Name Size Version Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
Vicuna 7B 🟢GEMV 1 64 64 639.65 198.848 4.50 GB (19.05%)
Vicuna 7B 🟢GEMV 1 2048 2048 1123.63 133.191 6.15 GB (26.02%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🔵GEMM 1 64 64 1093.35 156.317 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 1 2048 2048 3897.02 114.355 5.55 GB (23.48%)
Mistral 7B 🔵GEMM 8 64 64 4199.18 1185.25 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 8 2048 2048 3661.46 829.754 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🟢GEMV 1 64 64 531.99 188.29 4.28 GB (18.08%)
Mistral 7B 🟢GEMV 1 2048 2048 903.83 130.66 5.55 GB (23.48%)
Mistral 7B 🔴GEMV 8 64 64 897.87 486.46 4.33 GB (18.31%)
Mistral 7B 🔴GEMV 8 2048 2048 884.22 411.893 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
TinyLlama 1B 🟢GEMV 1 64 64 1088.63 548.993 0.86 GB (3.62%)
TinyLlama 1B 🟢GEMV 1 2048 2048 5178.98 431.468 2.10 GB (8.89%)
... ... ... ... ... ... ... ... ...
Llama 2 13B 🔵GEMM 1 64 64 820.34 96.74 8.47 GB (35.83%)
Llama 2 13B 🔵GEMM 1 2048 2048 2279.41 73.8213 10.28 GB (43.46%)
Llama 2 13B 🔵GEMM 3 64 64 1593.88 286.249 8.57 GB (36.24%)
Llama 2 13B 🔵GEMM 3 2048 2048 2226.7 189.573 16.90 GB (71.47%)
... ... ... ... ... ... ... ... ...
MPT 7B 🔵GEMM 1 64 64 1079.06 161.344 3.67 GB (15.51%)
MPT 7B 🔵GEMM 1 2048 2048 4069.78 114.982 5.87 GB (24.82%)
... ... ... ... ... ... ... ... ...
Falcon 7B 🔵GEMM 1 64 64 1139.93 133.585 4.47 GB (18.92%)
Falcon 7B 🔵GEMM 1 2048 2048 2850.97 115.73 6.83 GB (28.88%)
... ... ... ... ... ... ... ... ...
CodeLlama 34B 🔵GEMM 1 64 64 681.74 41.01 19.05 GB (80.57%)
CodeLlama 34B 🔵GEMM 1 2048 2048 1072.36 35.8316 20.26 GB (85.68%)
... ... ... ... ... ... ... ... ...
DeepSeek 33B 🔵GEMM 1 64 64 1160.18 40.29 18.92 GB (80.00%)
DeepSeek 33B 🔵GEMM 1 2048 2048 1012.1 34.0093 19.87 GB (84.02%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.2.0-cp311-cp311-win_amd64.whl (79.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.2.0-cp311-cp311-manylinux2014_x86_64.whl (79.0 kB view details)

Uploaded CPython 3.11

autoawq-0.2.0-cp310-cp310-win_amd64.whl (79.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.2.0-cp310-cp310-manylinux2014_x86_64.whl (79.0 kB view details)

Uploaded CPython 3.10

autoawq-0.2.0-cp39-cp39-win_amd64.whl (79.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.2.0-cp39-cp39-manylinux2014_x86_64.whl (79.0 kB view details)

Uploaded CPython 3.9

autoawq-0.2.0-cp38-cp38-win_amd64.whl (79.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.2.0-cp38-cp38-manylinux2014_x86_64.whl (79.0 kB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4d6080539bb386a5754cc76b5081b112a93df1ee38f4c2f82e2773e9f098470b
MD5 a0c5425a9ee648d1100dee3973f6e6b4
BLAKE2b-256 303bc74b418e8b6ac363940bc769777847602b3f59c9b91d08f9bff231fdd014

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee68699fec949c4440374b402558400efe83c359e7f85a5a7979608c5eec0da3
MD5 362c4dcd65b7956d730bfabbe986467f
BLAKE2b-256 0bd9797e50f0164a181ed7bc39091203e0b2f2872482e038bf3a172c1d773e27

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9cfefc8e8c4d92b9b78f2f1bff61d6bb413138d2ab221029587251344d65007c
MD5 3c43a34eb31e39f767d1cc6067a7867c
BLAKE2b-256 e1e251a1255b42f094ad9aa5b040edfbed02ee2c55e5847b35486125c2906cd0

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c9c4db6fbf23cd625a9cb5b5495777555659dc12aa7e0aba733f20c51f10005
MD5 844e437b8d83a300dada5087bc38eb6a
BLAKE2b-256 215bd7d3970995f71562ec5072e7a1bfdd10b72506adfedef7d9ad7446f242d1

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 3c5dd45bcf23d8a0de2d79a04baf65fb2208249babeb729274c97df6218d48ae
MD5 117683f3f57cfd69e053fb14dc113caa
BLAKE2b-256 83592d111cfaca1c609a7cd6b078384f744aa66667cda86954effed30f539563

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a40c12fc4ddeabec6f04a2179e720e79563bfe29646ddf9c130bce0bcb51a760
MD5 d72820123e6a322b732beae1af833450
BLAKE2b-256 092a76ea21fc5d63cae4838e9fe0c9ec86792d4098f5fe67df745e82863d4504

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 79.7 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 43651382592e348c8f44bdc6796b9fa6fc5bd398f58908410376f0b7aaa2b3b3
MD5 e2ea179b12634f4c773eb0495342fd9e
BLAKE2b-256 52eac497e9e4e1a6c7f3013792c94ea8e58e853873fd23b0ea5b1327dea50641

See more details on using hashes here.

File details

Details for the file autoawq-0.2.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 74d2c49780aaa7c7ba0fa4e1f196ac2dc4bdceba27e780115e7dfb32f1ba3c0a
MD5 aa64c989674a8f0f2e7ce85cf2d3565c
BLAKE2b-256 69509b4255a0fe07144538392757b470e1a7b79cd3a734b80e9b9c75114c77a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page