Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM)
  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 7.5 (sm75). Turing and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B
Aquila 7B
Aquila2 7B/34B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Aquila2 34B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA A100-SXM4-40GB
  • Command: python examples/benchmark.py --model_path casperhansen/aquilachat2-34b-awq --quant_file pytorch_model.bin.index.json
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 36.7505 23.423 18.26 GB (46.12%)
1 64 64 516.544 23.3536 18.26 GB (46.12%)
1 128 128 643.968 23.3803 18.26 GB (46.12%)
1 256 256 736.236 23.389 18.34 GB (46.32%)
1 512 512 829.405 23.3889 18.54 GB (46.84%)
1 1024 1024 836.023 23.3757 18.95 GB (47.87%)
1 2048 2048 802.632 23.3777 20.25 GB (51.15%)
1 4096 4096 722.49 23.4252 25.38 GB (64.12%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

autoawq-0.1.5-cp311-cp311-win_amd64.whl (249.1 kB view details)

Uploaded CPython 3.11Windows x86-64

autoawq-0.1.5-cp311-cp311-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.11

autoawq-0.1.5-cp310-cp310-win_amd64.whl (248.0 kB view details)

Uploaded CPython 3.10Windows x86-64

autoawq-0.1.5-cp310-cp310-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.10

autoawq-0.1.5-cp39-cp39-win_amd64.whl (248.1 kB view details)

Uploaded CPython 3.9Windows x86-64

autoawq-0.1.5-cp39-cp39-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.9

autoawq-0.1.5-cp38-cp38-win_amd64.whl (247.4 kB view details)

Uploaded CPython 3.8Windows x86-64

autoawq-0.1.5-cp38-cp38-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 249.1 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7b14d97a33a9d92a34870fff1df38c066e976e1e4f6172c8aee483d8e22de7fa
MD5 aa9109821b56da1df86036ec7c8c5c9e
BLAKE2b-256 6c82edfb0a3de73d9627dfcd3875f90e135db7e88fd38eeed8d4f76f5706e6a7

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.5-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d6b79198377eb737b3bd65363558e102db00f2145588fc030a3c532f371851d7
MD5 9132b95d9cd6982acd964bc8db013b81
BLAKE2b-256 d2f06356654125ea7488eb1b0b22261845ddc6a8fd8a278ab1d0bb3e9d279e15

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 248.0 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 d52ce717e2800b2845043786c1fc3c688e2b2c2e8dd113bf856b605c944e2631
MD5 09f23fc8210a82103d18d0d574370649
BLAKE2b-256 b5124d6b1fa2f30d8bff4df9abbb88684b14ff7d0c010c04076b4bb57b5565c8

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.5-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e488907e4c8c15822d9d47bf96b4f9a4b4c56de12b75d5e1aeed6b22efdc400c
MD5 f9a7fa54cda04f24621ae3158b0a0819
BLAKE2b-256 e413d0f3b908d31fde310679addbaeca2d2eb96a9b2adb622c881734b244cde1

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 248.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 8ac759184cb062d22e29abfb0e9c83e085297dad3a1d8d9bac3cb78bafc841fa
MD5 815d946fc2d36e036fe50114bb14b05e
BLAKE2b-256 8632226f04f04916ba958087d245d0998b515776be04c87e8f7926a4a09dac4d

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.5-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ab09ee2bd73787152a2bc85ea9db74c2092850341a0492f9ddf536f1c92b91f
MD5 5bc3cf3a23d1db5587737e2b3f9c9a72
BLAKE2b-256 edc1fa6af2645171b585c5ee5a3c34fced459d835ae10f1a3b92dccd9fb42241

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 247.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 535fc3aba5601cb37a4cc0c28e0921886f3fbde1f84dbac259844d83019fbc21
MD5 7fd7cb5270486aac1220ee31cce3ebd6
BLAKE2b-256 309052e7cd06a50c08f33d769eb4901f6a25caa549aa30fc4b7a4167e1d1786b

See more details on using hashes here.

File details

Details for the file autoawq-0.1.5-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.5-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 736f5b64962c4f366b869fb6c3747e6cd0e76980792849fcb6c7b93e17462aaa
MD5 8012f79a39f53a5f44125349f3e7451e
BLAKE2b-256 1d9e570aa696170e3830d86f0125f55981a92dbcc77beacd31b161af63e9c6b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page