AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

These details have not been verified by PyPI

Project links

Homepage

Project description

AutoAWQ

AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the original work from MIT.

Roadmap:

Publish pip package
Refactor quantization code
Support more models
Optimize the speed of models

Install

Requirements:

Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
CUDA Toolkit 11.8 and later.

Install:

Use pip to install awq

pip install awq

Build source

Build AutoAWQ from scratch

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models	Sizes
LLaMA-2	7B/13B/70B
LLaMA	7B/13B/30B/65B
Vicuna	7B/13B
MPT	7B/30B
Falcon	7B/40B
OPT	125m/1.3B/2.7B/6.7B/13B/30B
Bloom	560m/3B/7B/
LLaVA-v0	13B
GPTJ	6.7B

Usage

Below, you will find examples for how to easily quantize a model and run inference.

Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

Run inference on a quantized model from Huggingface:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

model.generate(...)

Benchmarks

Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.

Model	GPU	FP16 latency (ms)	INT4 latency (ms)	Speedup
LLaMA-2-7B	4090	19.97	8.66	2.31x
LLaMA-2-13B	4090	OOM	13.54	--
Vicuna-7B	4090	19.09	8.61	2.22x
Vicuna-13B	4090	OOM	12.17	--
MPT-7B	4090	17.09	12.58	1.36x
MPT-30B	4090	OOM	23.54	--
Falcon-7B	4090	29.91	19.84	1.51x
LLaMA-2-7B	A6000	27.14	12.44	2.18x
LLaMA-2-13B	A6000	47.28	20.28	2.33x
Vicuna-7B	A6000	26.06	12.43	2.10x
Vicuna-13B	A6000	44.91	17.30	2.60x
MPT-7B	A6000	22.79	16.87	1.35x
MPT-30B	A6000	OOM	31.57	--
Falcon-7B	A6000	39.44	27.34	1.44x

Detailed benchmark (CPU vs. GPU)

Here is the difference between a fast and slow CPU on MPT-7B:

RTX 4090 + Intel i9 13900K (2 different VMs):

CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):

CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.9

May 11, 2025

0.2.8

Jan 20, 2025

0.2.7.post3

Dec 6, 2024

0.2.7.post2

Nov 18, 2024

0.2.7.post1

Nov 16, 2024

0.2.7

Nov 16, 2024

0.2.6

Jul 23, 2024

0.2.5

May 2, 2024

0.2.4

Mar 24, 2024

0.2.3

Mar 2, 2024

0.2.2

Feb 17, 2024

0.2.1

Feb 16, 2024

0.2.0

Feb 15, 2024

0.1.8

Dec 23, 2023

0.1.7

Nov 16, 2023

0.1.6

Nov 4, 2023

0.1.5

Oct 28, 2023

0.1.4

Oct 6, 2023

0.1.3

Oct 5, 2023

0.1.2

Oct 2, 2023

0.1.1

Oct 1, 2023

0.1.0

Sep 21, 2023

0.0.2

Sep 6, 2023

This version

0.0.1

Sep 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.0.1-cp311-cp311-win_amd64.whl (175.5 kB view details)

Uploaded Sep 1, 2023 CPython 3.11Windows x86-64

autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Sep 1, 2023 CPython 3.11

autoawq-0.0.1-cp310-cp310-win_amd64.whl (174.8 kB view details)

Uploaded Sep 1, 2023 CPython 3.10Windows x86-64

autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Sep 1, 2023 CPython 3.10

autoawq-0.0.1-cp39-cp39-win_amd64.whl (174.8 kB view details)

Uploaded Sep 1, 2023 CPython 3.9Windows x86-64

autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Sep 1, 2023 CPython 3.9

autoawq-0.0.1-cp38-cp38-win_amd64.whl (174.5 kB view details)

Uploaded Sep 1, 2023 CPython 3.8Windows x86-64

autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded Sep 1, 2023 CPython 3.8

File details

Details for the file autoawq-0.0.1-cp311-cp311-win_amd64.whl.

File metadata

Download URL: autoawq-0.0.1-cp311-cp311-win_amd64.whl
Upload date: Sep 1, 2023
Size: 175.5 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`41ce2050e954aa385dfe940c71b4f4acaea898d51fd0d2c55c51c96fb5d2cff4`
MD5	`6920d53248ba74b1f77bbde399f51799`
BLAKE2b-256	`6ebdeb055041f2d7a1164dcb23863057b2a56c3bd4580619237d1b20029e2ca7`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl
Upload date: Sep 1, 2023
Size: 3.3 MB
Tags: CPython 3.11
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`30a2d96dd14b734309eb06252df8fed030ebf206c96a9a89e13bde5af803bacc`
MD5	`6aa676bfe56ea03692d9b59233d05f02`
BLAKE2b-256	`a89b2c597d62b05e904234669eb8dc947d60098eb5192cc9e181837f518c5d5b`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp310-cp310-win_amd64.whl.

File metadata

Download URL: autoawq-0.0.1-cp310-cp310-win_amd64.whl
Upload date: Sep 1, 2023
Size: 174.8 kB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`71fb252a172db271402acb4c46c664951c305d93dc434e8843d5fe7b1cbc14de`
MD5	`b43266f991d481a43b3a3a1c316f9847`
BLAKE2b-256	`65ffcc33c1791b7900f3d07104a34490e566f053f8546c84a25ef45bde9f415a`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Sep 1, 2023
Size: 3.3 MB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`43f46f590619df7b3bb00ee6d833a743bb6b18e6e8cd5dac12e13f71bd040841`
MD5	`5f9a0ebde1176e1733f8000a0d4b48fa`
BLAKE2b-256	`4babee760d5c378d7c73f35dfb19a1d8622ea02f3e6b1a7ef24346abec8cebe0`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp39-cp39-win_amd64.whl.

File metadata

Download URL: autoawq-0.0.1-cp39-cp39-win_amd64.whl
Upload date: Sep 1, 2023
Size: 174.8 kB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`14ffe1c42582ea55814ce78dff1be5a073a50049a7498943f56d395091067d45`
MD5	`cfc323c1a86d877b9bcc59a2baa8f1b0`
BLAKE2b-256	`afd34a05a75d9e076717c4ee8aee62c1f97ddf3d5c8b5e0d820115f7e049b6ab`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl
Upload date: Sep 1, 2023
Size: 3.3 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`0aec22f22d0b1b02a0fd60bea0ff52b7f722f258497bd8795c61021636265b05`
MD5	`8a648e4dd6cf8811f887795fa65184a3`
BLAKE2b-256	`b25a9ccd8fa76cd60b40c0f224e0300c680f3e6080dd753718a07cbb2a5a1aad`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp38-cp38-win_amd64.whl.

File metadata

Download URL: autoawq-0.0.1-cp38-cp38-win_amd64.whl
Upload date: Sep 1, 2023
Size: 174.5 kB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`bdd811e50cfef89112d29791fed0a9ce3d84c5137a147685b1cb5ea07b27bea0`
MD5	`33366cd2b19d96cc47bf5924ff897570`
BLAKE2b-256	`5c6714dd990569d8a6595638564a3b480d3d67c96f742540218bdea69e55507f`

See more details on using hashes here.

File details

Details for the file autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl
Upload date: Sep 1, 2023
Size: 3.3 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`f04fe762cf142335557a0024699fb60ae6e061b7b86a6ad1792c118a5dc3e1d9`
MD5	`8bd3dbd38410ad563f76e29e51ceaf2b`
BLAKE2b-256	`63c8648df935b3452f831591cbc1ccdae2a9e66a8923960def1c8066e1e72bfa`

See more details on using hashes here.

autoawq 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoAWQ

Install

Build source

Supported models

Usage

Quantization

Inference

Benchmarks

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes