Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.1.2-cp311-cp311-win_amd64.whl (224.5 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.1.2-cp311-cp311-manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded CPython 3.11

autoawq-0.1.2-cp310-cp310-win_amd64.whl (223.4 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.10

autoawq-0.1.2-cp39-cp39-win_amd64.whl (223.5 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.1.2-cp39-cp39-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.9

autoawq-0.1.2-cp38-cp38-win_amd64.whl (223.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.1.2-cp38-cp38-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 224.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b90f7a083364f9ecfe8edcebd87f873a6a4ac127fca5d0e11cfef1ab4312dba8
MD5 c3932f9e0e64562836dcf6fba7e26734
BLAKE2b-256 00b6cce851c3cc088d0b322c6e8954845c800efb739dde0acaba59b611ad7f4d

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a4c867f87a223f9ae4dadb04fa17b4b27298df0b9ac106e6371765bcd630403f
MD5 de86ca99b988fbe55b3d0eac7cdcd64c
BLAKE2b-256 9931f75902b88d88c1997a8845717b523770b8507bccf41adbc059a692bd9eb0

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 223.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7f6b21026d3d868a7ba84964616c0b7fc3bfde593ad7f9029bcd8d3240adf386
MD5 6814dce8324c1d750a0e2591aa79f6cc
BLAKE2b-256 892ca63c2787b69f03b985051ff2a2926b113b0b084b07f2d1d277e84485f8b8

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 97568558d943de0922aef93e53b7006a44c40d7181e6a5892130b9a3a5d9c9c3
MD5 cca619c377739c9983e6a13286135842
BLAKE2b-256 44f29e167e4cfe1def13fd4b5c5cbe0cbc8cb36aa9c39bb3763bbcea117d6e11

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 223.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 fdb017015bdda59e84907b384109daa5bda8e8b0d36cdb86b9633b0db787e4d6
MD5 8a90dae711502256cbe90ba40906b0ff
BLAKE2b-256 d5578d36f971dd2ad6da9a4abdcf657a5a02f4cb90eca8222776e77e0faa74c2

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 46452b116254cc9a3bcc9888ceeb3dd4c8a128bf30d42cc3afe367f4285e13cb
MD5 f0a7180d4e3c910c7177de4170a0c524
BLAKE2b-256 060800aaa831c1534ea930416b2fc0b170165394aa12fe1d84022fd7f84fa4f7

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 223.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6abe7647403316309ea37d71fd147b2c85fef38144bf6b2bfd00fda78f099ad3
MD5 b8d6026457808485321d55494dd85931
BLAKE2b-256 44d2f4ad135568af17acbe43ec9693adc008227f59331ae5c85ef3b5a17dc658

See more details on using hashes here.

File details

Details for the file autoawq-0.1.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a1a4bc3c05f774145afe58ed3d90d56f16891f4f4935997ea709da89d57f77d7
MD5 442eefb3365bc0664e8701c1d7598d9e
BLAKE2b-256 fe942df24a654400ce74289d453fca824bd1acd04c85edb7a4705082ab278fa6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page