Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 7.5 (sm75). Turing and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.1.4-cp311-cp311-win_amd64.whl (246.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.1.4-cp311-cp311-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.11

autoawq-0.1.4-cp310-cp310-win_amd64.whl (244.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.1.4-cp310-cp310-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.10

autoawq-0.1.4-cp39-cp39-win_amd64.whl (245.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.1.4-cp39-cp39-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.9

autoawq-0.1.4-cp38-cp38-win_amd64.whl (244.3 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.1.4-cp38-cp38-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 246.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e18018416f799fcfeeba520a769a0fa9225e7cf2ce0fea1f1770be8e824dbbd9
MD5 83caca29ee884411878c3d941fa9d77c
BLAKE2b-256 52b53525e0b42c39164315891b670a2647dbbccddfe3a9a1b14c194a4b7f9ee1

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 522d209fee771e1413d2d29ead777aba136de865bf3b764347787d0e77363687
MD5 7d0ebc0833d0c98d921d2eb2064302a6
BLAKE2b-256 869045b47d65d0fbb2c1c8a8f5c2aa9a06d69ffe9c5f9be7df406b8cbf3349d6

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 244.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 506064d36e322cdf207e425978db2db59ee259dbcfde8ee604f48b88b4742e72
MD5 2efb1a3d020e385de7b0fcc2cbf7c680
BLAKE2b-256 49b8867f1e43c7ffe30fc073edcb9fe072416e38a106bba2dc9a53577f6aae4b

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ef2d79487b4a04aa113f02959ed50d34969f777f5801075e20e2235d06e12387
MD5 3a20e0c20de5b1a0a40485fa71ba899f
BLAKE2b-256 07dc3e44612beb2747b524873959aded2d648d6d6bda9fb15ce25a6ee99edcdf

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 245.0 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 c92d2e1e6d45e93b72eefb8a18b6abddc860dac588ae2f57137d6a94b1b90641
MD5 415322ec3d95f9f966db44dfe67d7453
BLAKE2b-256 60222a84bbc1e7bc227a871a752fd8accfe0af23d0d1af28c831805651770035

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 745e531afce5f90aa2ff68dd6c82aba9117bbd3cff2f431a6b6bc972a069e4ae
MD5 ec617e0382b5e6021060339188c40635
BLAKE2b-256 36aae47975884a5330317596ec2808e094c72f5e31913e54258a146bdbd65e01

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 244.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bbc902cc9ac10d350e58da329c298921088537e1ddf933ae13a7a007f02c4494
MD5 161c99f799fcdd58b8740ae31089e897
BLAKE2b-256 a10a8f0aaf5a6d960be26974eade12cb06693299498650b266debe71be815c1d

See more details on using hashes here.

File details

Details for the file autoawq-0.1.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5c7195ba03b07d0945d6e2f0a89d60de05a9c6587ef41dde9d8751fab7b5893d
MD5 c70254d3d44b11d5991e3c18de0bbf10
BLAKE2b-256 76733ea1b735995605cf14a3d01716be1e58d5d305fa9c6f85c439b5c66f6e5c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page