Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

Supported by

RunPod Logo

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2024/06] CPU inference support (x86) - thanks Intel. Cohere and Phi3 support.
  • [2024/04] StableLM and StarCoder2 support.
  • [2024/03] Gemma support.
  • [2024/02] PEFT-compatible training in FP16.
  • [2024/02] AMD ROCm support through ExLlamaV2 kernels.
  • [2024/01] Export to GGUF, ExLlamaV2 kernels, 60% faster context processing.
  • [2023/12] Mixtral, LLaVa, QWen, Baichuan model support.
  • [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels.
  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Prerequisites

  • NVIDIA:
    • Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported.
    • Your CUDA version must be CUDA 11.8 or later.
  • AMD:
    • Your ROCm version must be ROCm 5.6 or later.

Install from PyPi

To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed.

pip install autoawq

Build from source

For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the release page:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl

Or from the main branch directly:

pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git

Or by cloning the repository and installing from source:

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

All three methods will install the latest and correct kernels for your system from AutoAWQ_Kernels.

If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in AutoAWQ_Kernels and then install AutoAWQ from source.

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): 20% faster than GEMM, only batch size 1 (not good for large context).
  • GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
  • FP16 (non-quantized): Recommended for highest throughput: vLLM.

Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication.

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with model.generate() and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

  • Fused modules are activated when you use fuse_layers=True.
  • A custom cache is implemented. It preallocates based on batch size and sequence length.
    • You cannot change the sequence length after you have created your model.
    • Reference: AutoAWQForCausalLM.from_quantized(max_seq_len=seq_len, batch_size=batch_size)
  • The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
  • The past_key_values from model.generate() are only dummy values, so they cannot be used after generation.

Examples

More examples can be found in the examples directory.

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "TheBloke/zephyr-7B-beta-AWQ"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
<|system|>
</s>
<|user|>
{prompt}</s>
<|assistant|>"""

prompt = "You're standing on the surface of the Earth. "\
        "You walk one mile south, one mile west and one mile north. "\
        "You end up exactly where you started. Where are you?"

tokens = tokenizer(
    prompt_template.format(prompt=prompt), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_seq_len=512
)

Benchmarks

These benchmarks showcase the speed and memory usage of processing context (prefill) and generating tokens (decoding). The results include speed at various batch sizes and different versions of AWQ kernels. We have aimed to test models fairly using the same benchmarking tool that you can use to reproduce the results. Do note that speed may vary not only between GPUs but also between CPUs. What matters most is a GPU with high memory bandwidth and a CPU with high single core clock speed.

  • Tested with AutoAWQ version 0.1.6
  • GPU: RTX 4090 (AMD Ryzen 9 7950X)
  • Command: python examples/benchmark.py --model_path <hf_model> --batch_size 1
  • 🟢 for GEMV, 🔵 for GEMM, 🔴 for avoid using
Model Name Size Version Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
Vicuna 7B 🟢GEMV 1 64 64 639.65 198.848 4.50 GB (19.05%)
Vicuna 7B 🟢GEMV 1 2048 2048 1123.63 133.191 6.15 GB (26.02%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🔵GEMM 1 64 64 1093.35 156.317 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 1 2048 2048 3897.02 114.355 5.55 GB (23.48%)
Mistral 7B 🔵GEMM 8 64 64 4199.18 1185.25 4.35 GB (18.41%)
Mistral 7B 🔵GEMM 8 2048 2048 3661.46 829.754 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
Mistral 7B 🟢GEMV 1 64 64 531.99 188.29 4.28 GB (18.08%)
Mistral 7B 🟢GEMV 1 2048 2048 903.83 130.66 5.55 GB (23.48%)
Mistral 7B 🔴GEMV 8 64 64 897.87 486.46 4.33 GB (18.31%)
Mistral 7B 🔴GEMV 8 2048 2048 884.22 411.893 16.82 GB (71.12%)
... ... ... ... ... ... ... ... ...
TinyLlama 1B 🟢GEMV 1 64 64 1088.63 548.993 0.86 GB (3.62%)
TinyLlama 1B 🟢GEMV 1 2048 2048 5178.98 431.468 2.10 GB (8.89%)
... ... ... ... ... ... ... ... ...
Llama 2 13B 🔵GEMM 1 64 64 820.34 96.74 8.47 GB (35.83%)
Llama 2 13B 🔵GEMM 1 2048 2048 2279.41 73.8213 10.28 GB (43.46%)
Llama 2 13B 🔵GEMM 3 64 64 1593.88 286.249 8.57 GB (36.24%)
Llama 2 13B 🔵GEMM 3 2048 2048 2226.7 189.573 16.90 GB (71.47%)
... ... ... ... ... ... ... ... ...
MPT 7B 🔵GEMM 1 64 64 1079.06 161.344 3.67 GB (15.51%)
MPT 7B 🔵GEMM 1 2048 2048 4069.78 114.982 5.87 GB (24.82%)
... ... ... ... ... ... ... ... ...
Falcon 7B 🔵GEMM 1 64 64 1139.93 133.585 4.47 GB (18.92%)
Falcon 7B 🔵GEMM 1 2048 2048 2850.97 115.73 6.83 GB (28.88%)
... ... ... ... ... ... ... ... ...
CodeLlama 34B 🔵GEMM 1 64 64 681.74 41.01 19.05 GB (80.57%)
CodeLlama 34B 🔵GEMM 1 2048 2048 1072.36 35.8316 20.26 GB (85.68%)
... ... ... ... ... ... ... ... ...
DeepSeek 33B 🔵GEMM 1 64 64 1160.18 40.29 18.92 GB (80.00%)
DeepSeek 33B 🔵GEMM 1 2048 2048 1012.1 34.0093 19.87 GB (84.02%)

Multi-GPU

GPU: 2x NVIDIA GeForce RTX 4090

Model Size Version Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
Mixtral 46.7B 🔵GEMM 1 32 32 149.742 93.406 25.28 GB (53.44%)
Mixtral 46.7B 🔵GEMM 1 64 64 1489.64 93.184 25.32 GB (53.53%)
Mixtral 46.7B 🔵GEMM 1 128 128 2082.95 92.9444 25.33 GB (53.55%)
Mixtral 46.7B 🔵GEMM 1 256 256 2428.59 91.5187 25.35 GB (53.59%)
Mixtral 46.7B 🔵GEMM 1 512 512 2633.11 89.1457 25.39 GB (53.67%)
Mixtral 46.7B 🔵GEMM 1 1024 1024 2598.95 84.6753 25.75 GB (54.44%)
Mixtral 46.7B 🔵GEMM 1 2048 2048 2446.15 77.0516 27.98 GB (59.15%)
Mixtral 46.7B 🔵GEMM 1 4096 4096 1985.78 77.5689 34.65 GB (73.26%)

CPU

  • CPU: INTEL(R) XEON(R) PLATINUM 8592+ with 8-channel 4800MT/s memory.
  • Command: python examples/benchmark.py --model_path <hf_model> --batch_size 1
Model Size Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (RAM)
Mixtral 7B 1 64 64 389.24 16.01 5.59 GB (0.02%)
Mixtral 7B 1 2048 2048 1412 17.76 6.29 GB (0.03%)
Vicuna 7B 1 64 64 346 18.13 8.18 GB (0.03%)
Vicuna 7B 1 2048 2048 1023.4 18.18 8.80 GB (0.04%)
LLaMA2 13B 1 64 64 160.24 9.87 14.65 GB (0.06%)
LLaMA2 13B 1 2048 2048 592.35 9.93 16.87 GB (0.07%)
Mosaicml 7B 1 64 64 433.17 18.79 4.60 GB (0.02%)
Mosaicml 7B 1 2048 2048 404.25 19.91 4.75 GB (0.02%)
Falcon 7B 1 64 64 303.16 14.41 5.18 GB (0.02%)
Falcon 7B 1 2048 2048 634.57 15.55 5.80 GB (0.02%)
CodeLlama 34B 1 64 64 153.73 4.23 29.00 GB (0.12%)
CodeLlama 34B 1 2048 2048 274.25 4.38 35.21 GB (0.15%)
Deepseek-coder 33B 1 64 64 83.08 4.07 22.16 GB (0.09%)
Deepseek-coder 33B 1 2048 2048 296.04 4.33 37.05 GB

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.2.6-cp311-cp311-win_amd64.whl (95.7 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.2.6-cp311-cp311-manylinux2014_x86_64.whl (94.9 kB view details)

Uploaded CPython 3.11

autoawq-0.2.6-cp310-cp310-win_amd64.whl (95.7 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.2.6-cp310-cp310-manylinux2014_x86_64.whl (94.9 kB view details)

Uploaded CPython 3.10

autoawq-0.2.6-cp39-cp39-win_amd64.whl (95.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.2.6-cp39-cp39-manylinux2014_x86_64.whl (94.9 kB view details)

Uploaded CPython 3.9

autoawq-0.2.6-cp38-cp38-win_amd64.whl (95.7 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.2.6-cp38-cp38-manylinux2014_x86_64.whl (94.9 kB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.2.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 151cb65a5a00e72f061581ad0dae8b9609f3c6e75d347cddc98266586dcdec80
MD5 523f46b2515c0d129f099f22675a7358
BLAKE2b-256 b6d05007a1b49697506c3a495f82f935db0110b9102e4f46e0da3305ac1dfdde

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.6-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 151ca1b3065facb2a4bfe1336aa50490a5b0204d6712d2d7ca3988e4c9464982
MD5 638eac82c5f091630e96aaae2ea5e714
BLAKE2b-256 0177f11af29e50f5b774fa5bdf1c680b8c1e50d7f5c58ff2f04f43ae74f8eaae

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 fc2c72ae513b382adc3e3f17be71ae19b8d0dc309b865c8a9454888d55886ba5
MD5 43de7a368a2b7aa6b639848525f91c3e
BLAKE2b-256 eab74c68d279f82d498ea90f1480439a1c948deb48a850ee4cf8e56a9dc6adfd

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.6-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d7f8a4d5d6cdafd0262dcea0e532262da4be9af030991f037a5b6b8150b1a2f
MD5 cf40e6d6b0f755657742e6694534f1e6
BLAKE2b-256 401f4dc434b5a10d49aaaec3cfe6e82dd99aa49c1e3eb0bb2557b4163eb50e37

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 674f2d4e9d3b35958f58fc0047fbc220fbd3e698c3d998dbd62abbf82125facd
MD5 8226082e91407c53a6dbeee01061a197
BLAKE2b-256 5cba73729a6cdec7ea9cd26251de652f0c17f841cfe277e96b2341b0ebe41076

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.6-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c21d0107472d89f8a760c40efadf574f328fe51a9efb2a1fddcecba6c41df56d
MD5 5d54c1d8f3a27ac893cdaf72c08fbc39
BLAKE2b-256 9597dff352570e359514c5cd83c6fc594281785326115884cb3db258cf16717b

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.2.6-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 95.7 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for autoawq-0.2.6-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 b43009a479649cdfdec3a4f7aa04af177ea5e8bb374db94597c8a55e0e623cbc
MD5 f690a8ca44b6af03a236fc52ef69d533
BLAKE2b-256 1ba399d101b7ac300035fb17a7b95ecbfce23b2dd9880dbfbb1b47d980d6da17

See more details on using hashes here.

File details

Details for the file autoawq-0.2.6-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.2.6-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 03cfbcc9684133ccfc23a2feec5ee936dd11ddee6d80224110e34ab7f41fe79d
MD5 43e47e123ada1212e0f8dfbc19eb644c
BLAKE2b-256 c5fbf0b94ea86811a2347db89b2e7c17cc35de325f5acbacadf07aa85219234b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page