Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.1.1-cp311-cp311-win_amd64.whl (223.6 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.1.1-cp311-cp311-manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded CPython 3.11

autoawq-0.1.1-cp310-cp310-win_amd64.whl (222.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.1.1-cp310-cp310-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.10

autoawq-0.1.1-cp39-cp39-win_amd64.whl (222.7 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.1.1-cp39-cp39-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.9

autoawq-0.1.1-cp38-cp38-win_amd64.whl (222.6 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.1.1-cp38-cp38-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 223.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ee553872fc6f4fa8aedc769ce790a1f8dd34b18fa1fd6231ae809df8eba0509b
MD5 6229fe8c7819753392d9bcc22f66f3d8
BLAKE2b-256 577103beec45208f5e0551819de31827dd9d187d0b198fe1a52569d066db5103

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b0e8ced116fad64b7bfb61b44d72e46a8325ec11ed4cf924abc0fca7f866320
MD5 34afdf10f0d66a994d0c681aa74df88a
BLAKE2b-256 e07219d1c44cc0fa1ed47d7a5bcd423f34b30420d49783671801c95f2648f78f

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 222.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b8df766ff3c44c269be81ab98b93299119dfbafd16dfb878e7aa61d79f474234
MD5 8c126c060e5797eeeb74224e094ce2aa
BLAKE2b-256 e7bdf1d62360db7809932dc24cf574bc669d3e20b666818179ed4ff6cfbc7fdf

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d66410614fd8b21b4e4244125250d1bb328eef0bffbf7bc83fd786df78c67254
MD5 7804d753cda08a5e1a6d9eeda190808e
BLAKE2b-256 bdfdf605838479bd57c2ddb18ce0fb332a627b5e6506a81cd58b4ee4f4970b5d

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 222.7 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 1c902d72aa92e7b3a9c3b84706293f35e85d33305ed1d04faa44ff0f747c1cf5
MD5 cd85d9c1521c5bb5e614bad50e76dae1
BLAKE2b-256 97143bc0beaf53c1bd66cfaa52fc173e72eeadac4f33faf92d65eed7cee415d2

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 90e18298c2ca52a9742147c835ac4b5ccaded490e6bca072f6965770530e266c
MD5 b071c9fddec7657a3c13e29af3c5866f
BLAKE2b-256 2c7e2460a4cc4f03f7d27e3707ba4799a3f422bcdf0cdef5debfaca49d9143ee

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 222.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 f6187867d3eab3d8a47409b1e53ded9bfee62825ee6d356b3c08895da37b533f
MD5 0e4e740c87506487127801d79dd21ec0
BLAKE2b-256 ba4edc39f3215f71c106c47693fd26c39524dd5aad12321b57356afda8d312cf

See more details on using hashes here.

File details

Details for the file autoawq-0.1.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e05307a931908e04b7d7670e451ddb25abeb04ba233c0394262298329d9ed6fc
MD5 9090a880f123166640f6482860ae6375
BLAKE2b-256 19404041ca9df2227f1e2c58d1a960d0554e6157cb7053197b0039ff4a7261d0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page