Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the original work from MIT.

Roadmap:

  • Publish pip package
  • Refactor quantization code
  • Support more models
  • Optimize the speed of models

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install awq

Build source

Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
LLaVA-v0 13B
GPTJ 6.7B

Usage

Below, you will find examples for how to easily quantize a model and run inference.

Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

Run inference on a quantized model from Huggingface:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

model.generate(...)

Benchmarks

Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.

Model GPU FP16 latency (ms) INT4 latency (ms) Speedup
LLaMA-2-7B 4090 19.97 8.66 2.31x
LLaMA-2-13B 4090 OOM 13.54 --
Vicuna-7B 4090 19.09 8.61 2.22x
Vicuna-13B 4090 OOM 12.17 --
MPT-7B 4090 17.09 12.58 1.36x
MPT-30B 4090 OOM 23.54 --
Falcon-7B 4090 29.91 19.84 1.51x
LLaMA-2-7B A6000 27.14 12.44 2.18x
LLaMA-2-13B A6000 47.28 20.28 2.33x
Vicuna-7B A6000 26.06 12.43 2.10x
Vicuna-13B A6000 44.91 17.30 2.60x
MPT-7B A6000 22.79 16.87 1.35x
MPT-30B A6000 OOM 31.57 --
Falcon-7B A6000 39.44 27.34 1.44x
Detailed benchmark (CPU vs. GPU)

Here is the difference between a fast and slow CPU on MPT-7B:

RTX 4090 + Intel i9 13900K (2 different VMs):

  • CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
  • CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):

  • CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
  • CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
  • CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.0.1-cp311-cp311-win_amd64.whl (175.5 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.11

autoawq-0.0.1-cp310-cp310-win_amd64.whl (174.8 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.10

autoawq-0.0.1-cp39-cp39-win_amd64.whl (174.8 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.9

autoawq-0.0.1-cp38-cp38-win_amd64.whl (174.5 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.0.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 175.5 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 41ce2050e954aa385dfe940c71b4f4acaea898d51fd0d2c55c51c96fb5d2cff4
MD5 6920d53248ba74b1f77bbde399f51799
BLAKE2b-256 6ebdeb055041f2d7a1164dcb23863057b2a56c3bd4580619237d1b20029e2ca7

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 30a2d96dd14b734309eb06252df8fed030ebf206c96a9a89e13bde5af803bacc
MD5 6aa676bfe56ea03692d9b59233d05f02
BLAKE2b-256 a89b2c597d62b05e904234669eb8dc947d60098eb5192cc9e181837f518c5d5b

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 174.8 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 71fb252a172db271402acb4c46c664951c305d93dc434e8843d5fe7b1cbc14de
MD5 b43266f991d481a43b3a3a1c316f9847
BLAKE2b-256 65ffcc33c1791b7900f3d07104a34490e566f053f8546c84a25ef45bde9f415a

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 43f46f590619df7b3bb00ee6d833a743bb6b18e6e8cd5dac12e13f71bd040841
MD5 5f9a0ebde1176e1733f8000a0d4b48fa
BLAKE2b-256 4babee760d5c378d7c73f35dfb19a1d8622ea02f3e6b1a7ef24346abec8cebe0

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 174.8 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 14ffe1c42582ea55814ce78dff1be5a073a50049a7498943f56d395091067d45
MD5 cfc323c1a86d877b9bcc59a2baa8f1b0
BLAKE2b-256 afd34a05a75d9e076717c4ee8aee62c1f97ddf3d5c8b5e0d820115f7e049b6ab

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0aec22f22d0b1b02a0fd60bea0ff52b7f722f258497bd8795c61021636265b05
MD5 8a648e4dd6cf8811f887795fa65184a3
BLAKE2b-256 b25a9ccd8fa76cd60b40c0f224e0300c680f3e6080dd753718a07cbb2a5a1aad

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 174.5 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 bdd811e50cfef89112d29791fed0a9ce3d84c5137a147685b1cb5ea07b27bea0
MD5 33366cd2b19d96cc47bf5924ff897570
BLAKE2b-256 5c6714dd990569d8a6595638564a3b480d3d67c96f742540218bdea69e55507f

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f04fe762cf142335557a0024699fb60ae6e061b7b86a6ad1792c118a5dc3e1d9
MD5 8bd3dbd38410ad563f76e29e51ceaf2b
BLAKE2b-256 63c8648df935b3452f831591cbc1ccdae2a9e66a8923960def1c8066e1e72bfa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page