Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/11] AutoAWQ has been merged into 🤗 transformers. Now includes CUDA 12.1 wheels.
  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 7.5 (sm75). Turing and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Mistral 7B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B
Aquila 7B
Aquila2 7B/34B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): 20% faster than GEMM for small batch sizes (max batch size 4 / small context).
  • GEMM (quantized): Much faster than FP16 at batch sizes below 8 (good with large contexts).
  • FP16 (non-quantized): Recommended for highest throughput: vLLM.

Compute-bound vs Memory-bound

At small batch sizes with small 7B models, we are memory-bound. This means we are bound by the bandwidth our GPU has to push around the weights in memory, and this is essentially what limits how many tokens per second we can generate. Being memory-bound is what makes quantized models faster because your weights are 3x smaller and can therefore be pushed around in memory much faster. This is different from being compute-bound where the main time spent during generation is doing matrix multiplication.

In the scenario of being compute-bound, which happens at higher batch sizes, you will not gain a speed-up using a W4A16 quantized model because the overhead of dequantization will slow down the overall generation. This happens because AWQ quantized models only store the weights in INT4 but perform FP16 operations during inference, so we are essentially converting INT4 -> FP16 during inference.

Fused modules

Fused modules are a large part of the speedup you get from AutoAWQ. The idea is to combine multiple layers into a single operation, thus becoming more efficient. Fused modules represent a set of custom modules that work separately from Huggingface models. They are compatible with model.generate() and other Huggingface methods, which comes with some inflexibility in how you can use your model if you activate fused modules:

  • Fused modules are activated when you use fuse_layers=True.
  • A custom cache is implemented. It preallocates based on batch size and sequence length.
    • You cannot change the sequence length or batch size after you have created your model.
    • Reference: AutoAWQForCausalLM.from_quantized(max_new_tokens=seq_len, batch_size=batch_size)
  • The main accelerator in the fused modules comes from FasterTransformer, which is only compatible with Linux.
  • The past_key_values from model.generate() are only dummy values, so they cannot be used after generation.

Examples

More examples can be found in the examples directory.

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

  • GPU: RTX 3090
  • Command: python examples/benchmark --model_path <hf_model>
Model Name Version Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
Vicuna 7B GEMM 1 64 64 2618.88 125.428 4.57 GB (19.31%)
Vicuna 7B GEMM 1 128 128 2808.09 123.865 4.61 GB (19.44%)
... ... ... ... ... ... ... ...
Vicuna 7B GEMV 1 64 64 233.909 154.475 4.66 GB (19.68%)
Vicuna 7B GEMV 1 128 128 233.145 152.133 4.66 GB (19.68%)
... ... ... ... ... ... ... ...
MPT 7B GEMM 1 64 64 2752.9 120.772 3.67 GB (15.48%)
MPT 7B GEMM 1 128 128 2982.67 119.52 3.70 GB (15.61%)
... ... ... ... ... ... ... ...
MPT 7B GEMV 1 64 64 241.026 136.476 3.67 GB (15.48%)
MPT 7B GEMV 1 128 128 239.44 137.599 3.70 GB (15.61%)
... ... ... ... ... ... ... ...
Falcon 7B GEMM 1 64 64 1920.61 94.5963 4.48 GB (18.92%)
Falcon 7B GEMM 1 128 128 2406.1 94.793 4.48 GB (18.92%)
... ... ... ... ... ... ... ...
Aquila2 34B GEMM 1 64 64 516.544 23.3536 18.26 GB (46.12%)
Aquila2 34B GEMM 1 128 128 643.968 23.3803 18.26 GB (46.12%)
... ... ... ... ... ... ... ...

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.1.6-cp311-cp311-win_amd64.whl (245.1 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.1.6-cp311-cp311-manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.11

autoawq-0.1.6-cp310-cp310-win_amd64.whl (243.5 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.1.6-cp310-cp310-manylinux2014_x86_64.whl (20.5 MB view details)

Uploaded CPython 3.10

autoawq-0.1.6-cp39-cp39-win_amd64.whl (243.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.1.6-cp39-cp39-manylinux2014_x86_64.whl (20.4 MB view details)

Uploaded CPython 3.9

autoawq-0.1.6-cp38-cp38-win_amd64.whl (243.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.1.6-cp38-cp38-manylinux2014_x86_64.whl (20.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 245.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 28d37604ed24004a15362209bff715505c5fd7c95981f9fbc0e8e4d485937cb0
MD5 314f1503ac42996e1d248e001d0e7b23
BLAKE2b-256 e000038d7cd6d13b729d4b62ff13027c32a612dfb5dbc5d33f29f81cb767745f

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.6-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 43c4dea266598b34f600b16a2d6b8a6eca0d40b8e0a71d20e694d67b83db37b5
MD5 618c036a632f2080a61ec8685352a8e9
BLAKE2b-256 3c4fe9d1c16d9b4891135ececa737722a3aefc4f8b20bf61ebd6d200144a1506

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 243.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cfd76b680b869017cf828367a0033fd7d4634dceeb372228ab808a98179f2d47
MD5 4f7cdeb42c20f31c2fc4cd67fc41d7fa
BLAKE2b-256 d1c9ff85e54da10ad3c2b7cff46ce2a2fdae87d3f136cdd36c68946762340455

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.6-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 74e1c0d2afc18ace2d2651cc34679349411c7eefedb422f12a70a1be0e2957f2
MD5 6e952baf264ca2b25a85e60870541ffc
BLAKE2b-256 6f62f7a90da6fa61a1ea5a463ad97a9999292d51f9af1c341e1c2ad8b87639e2

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 243.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 a1c60215c5936f55141bb931cfea9b9202abdf6c25a09b6acc303a5ea706dc12
MD5 6fb2ce6640c3acb5438f4d01a1fb3c05
BLAKE2b-256 87ac717b59a84e6b00a01bd8653f3331341c84a667bddc97c179465769c68eec

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.6-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1e6fe32856461d31fcd42fa0c0f1532981dea084aa98b5a4fd51a0c5ac7c25e6
MD5 623fb25df1f80c7134cfa32298e8d9f7
BLAKE2b-256 b72e3da6bd6314e68ce6a6951c69a9be3c55e5038cb4f05f1838eabb7003945a

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.6-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 243.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.6-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6194cc8eea954cfb936e8b4d53b94128eca161648f0b9a5de86a846ca1b42256
MD5 15a21b7ede62f56d363dcfa5d37bdd60
BLAKE2b-256 e620160ead3bcb53fdd8907652ce7aae0c3f7a8fd439146f4dcddb134d200ca2

See more details on using hashes here.

File details

Details for the file autoawq-0.1.6-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.6-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aa6dda6ce5dc1933fd3b2a03a65943238232a3669466b5b9045d43afa576da85
MD5 3524bffa5925c608c69938e20e2bf914
BLAKE2b-256 4cb819c9755499d5c72a2e70cad8b5f90a4000ee6d42cc66cfae52d84ea7234b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page