AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

These details have not been verified by PyPI

Project links

Homepage

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

[2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
[2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
[2023/08] PyPi package released and AutoModel class available

Install

Requirements:

Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
CUDA Toolkit 11.8 and later.

Install:

Use pip to install awq

pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models	Sizes
LLaMA-2	7B/13B/70B
LLaMA	7B/13B/30B/65B
Vicuna	7B/13B
MPT	7B/30B
Falcon	7B/40B
OPT	125m/1.3B/2.7B/6.7B/13B/30B
Bloom	560m/3B/7B/
GPTJ	6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)

AutoAWQForCausalLM.from_quantized

quant_path: Path to folder containing model files.
quant_filename: The filename to model weights or index.json file.
max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
fuse_layers: Whether or not to use fused layers.
batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

Note: Blazing fast generation, slow context processing
GPU: NVIDIA GeForce RTX 3090
Version: GEMV
Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	231.393	153.632	4.66 GB (19.68%)
1	64	64	233.909	154.475	4.66 GB (19.68%)
1	128	128	233.145	152.133	4.66 GB (19.68%)
1	256	256	228.562	147.692	4.67 GB (19.72%)
1	512	512	228.914	139.179	4.80 GB (20.26%)
1	1024	1024	227.393	125.058	5.56 GB (23.48%)
1	2048	2048	225.736	123.228	8.08 GB (34.09%)

Note: Fast generation, fast context processing
GPU: NVIDIA GeForce RTX 3090
Version: GEMM
Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	521.444	126.51	4.55 GB (19.21%)
1	64	64	2618.88	125.428	4.57 GB (19.31%)
1	128	128	2808.09	123.865	4.61 GB (19.44%)
1	256	256	2807.46	120.779	4.67 GB (19.72%)
1	512	512	2769.9	115.08	4.80 GB (20.26%)
1	1024	1024	2640.95	105.493	5.56 GB (23.48%)
1	2048	2048	2341.36	104.188	8.08 GB (34.09%)

MPT 7B

Note: Blazing fast generation, slow context processing
GPU: NVIDIA GeForce RTX 3090
Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
Version: GEMV

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	187.332	136.765	3.65 GB (15.42%)
1	64	64	241.026	136.476	3.67 GB (15.48%)
1	128	128	239.44	137.599	3.70 GB (15.61%)
1	256	256	233.184	137.02	3.76 GB (15.88%)
1	512	512	233.082	135.633	3.89 GB (16.41%)
1	1024	1024	231.504	122.197	4.40 GB (18.57%)
1	2048	2048	228.307	121.468	5.92 GB (24.98%)

Note: Fast generation, fast context processing
GPU: NVIDIA GeForce RTX 3090
Version: GEMM
Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	557.714	118.567	3.65 GB (15.42%)
1	64	64	2752.9	120.772	3.67 GB (15.48%)
1	128	128	2982.67	119.52	3.70 GB (15.61%)
1	256	256	3009.16	116.911	3.76 GB (15.88%)
1	512	512	2901.91	111.607	3.95 GB (16.68%)
1	1024	1024	2718.68	102.623	4.40 GB (18.57%)
1	2048	2048	2363.61	101.368	5.92 GB (24.98%)

Falcon 7B

Note: Fast generation, fast context processing
GPU: NVIDIA GeForce RTX 3090
Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
Version: GEMM

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	32	32	466.826	95.1413	4.47 GB (18.88%)
1	64	64	1920.61	94.5963	4.48 GB (18.92%)
1	128	128	2406.1	94.793	4.48 GB (18.92%)
1	256	256	2521.08	94.1144	4.48 GB (18.92%)
1	512	512	2478.28	93.4123	4.48 GB (18.92%)
1	1024	1024	2256.22	94.0237	4.69 GB (19.78%)
1	2048	2048	1831.71	94.2032	6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.9

May 11, 2025

0.2.8

Jan 20, 2025

0.2.7.post3

Dec 6, 2024

0.2.7.post2

Nov 18, 2024

0.2.7.post1

Nov 16, 2024

0.2.7

Nov 16, 2024

0.2.6

Jul 23, 2024

0.2.5

May 2, 2024

0.2.4

Mar 24, 2024

0.2.3

Mar 2, 2024

0.2.2

Feb 17, 2024

0.2.1

Feb 16, 2024

0.2.0

Feb 15, 2024

0.1.8

Dec 23, 2023

0.1.7

Nov 16, 2023

0.1.6

Nov 4, 2023

0.1.5

Oct 28, 2023

0.1.4

Oct 6, 2023

This version

0.1.3

Oct 5, 2023

0.1.2

Oct 2, 2023

0.1.1

Oct 1, 2023

0.1.0

Sep 21, 2023

0.0.2

Sep 6, 2023

0.0.1

Sep 1, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoawq-0.1.3-cp311-cp311-win_amd64.whl (245.0 kB view details)

Uploaded Oct 5, 2023 CPython 3.11Windows x86-64

autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded Oct 5, 2023 CPython 3.11

autoawq-0.1.3-cp310-cp310-win_amd64.whl (244.0 kB view details)

Uploaded Oct 5, 2023 CPython 3.10Windows x86-64

autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded Oct 5, 2023 CPython 3.10

autoawq-0.1.3-cp39-cp39-win_amd64.whl (244.1 kB view details)

Uploaded Oct 5, 2023 CPython 3.9Windows x86-64

autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded Oct 5, 2023 CPython 3.9

autoawq-0.1.3-cp38-cp38-win_amd64.whl (243.4 kB view details)

Uploaded Oct 5, 2023 CPython 3.8Windows x86-64

autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded Oct 5, 2023 CPython 3.8

File details

Details for the file autoawq-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

Download URL: autoawq-0.1.3-cp311-cp311-win_amd64.whl
Upload date: Oct 5, 2023
Size: 245.0 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`ca94be6dbe4d97e715b3c70017680a767264746e737a00b339455fb519280585`
MD5	`6ca9cc0c9ff3ae0b098f453c0cb7242b`
BLAKE2b-256	`fd1d0f3bea35aea9d8a646fefb098c78dc5af0260203549710f47d0a75420b04`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl
Upload date: Oct 5, 2023
Size: 20.0 MB
Tags: CPython 3.11
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`c674adcbfea3af574bd03da2fe2a74f1ca274b57fda0e59d8364425469a06783`
MD5	`a364087efbfb4edf77fb53a79521005f`
BLAKE2b-256	`15e871b6e90c495d0a53cde93cfde11e4c965614f890bc50de784958cc0b0fea`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp310-cp310-win_amd64.whl.

File metadata

Download URL: autoawq-0.1.3-cp310-cp310-win_amd64.whl
Upload date: Oct 5, 2023
Size: 244.0 kB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`1918d8a26d01ff8127a2e9271026be342406b6d6c02141c3aa611617266bb4c2`
MD5	`90c5adf73e82350304870676b8f14e11`
BLAKE2b-256	`49785acb52da285bba7aef386b9e7d4773b52fcb2230a20da44ffde2c8e6511c`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl
Upload date: Oct 5, 2023
Size: 20.0 MB
Tags: CPython 3.10
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`785c33d6493130aaec9d0af341d5aafbfa490df291c8b0118e68d74564360280`
MD5	`27d2756e221ca37eb88292b3bbf8c9b9`
BLAKE2b-256	`9621a66c36c632951facda8a9cfbb2a25bb28558bfb4dd1acec9899e8a60daa3`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp39-cp39-win_amd64.whl.

File metadata

Download URL: autoawq-0.1.3-cp39-cp39-win_amd64.whl
Upload date: Oct 5, 2023
Size: 244.1 kB
Tags: CPython 3.9, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp39-cp39-win_amd64.whl
Algorithm	Hash digest
SHA256	`96876670078c6e83748553f5c659dfd7043d69839b0af0535b261c2e2ca192dd`
MD5	`075d12837aa0e39766c2df96aa2c74a3`
BLAKE2b-256	`6aea96432ac13696dd3c082896b3cf095e80c5b589e86067f7527b0008587eec`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl
Upload date: Oct 5, 2023
Size: 20.0 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`175334d5429779333a1881faa7cd0628ab3c3c0eb82af09094b97d103dfc1c08`
MD5	`88aa9409846025fc5f29d70230052a2a`
BLAKE2b-256	`13fef6c2ee6b1d8ae04ee2520bdcfc61414b8f98c5975e5eb2caa5c6700d1a59`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp38-cp38-win_amd64.whl.

File metadata

Download URL: autoawq-0.1.3-cp38-cp38-win_amd64.whl
Upload date: Oct 5, 2023
Size: 243.4 kB
Tags: CPython 3.8, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp38-cp38-win_amd64.whl
Algorithm	Hash digest
SHA256	`afff79a385ba7384d8aab366585e5c2fd8179d0660266b2384859a89018d48a0`
MD5	`664b477691a5cd2dace67a1a45943994`
BLAKE2b-256	`8c29c893f096b6f29a72c743b7b6930e4fdca2e273c0a203ab3ccdf34d68da79`

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

Download URL: autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl
Upload date: Oct 5, 2023
Size: 19.9 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`704127b46fce550669d9ac7b1cfb626b9d226bf42062ed946bdbd3f2d285afb6`
MD5	`fb33fdf4d5dde0d12de67360f0f57ffa`
BLAKE2b-256	`f45b50f79271a6f84b0acab172d6bd5478c72e6743e136fc5916f870e4c6a59b`

See more details on using hashes here.

autoawq 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoAWQ

Install

Using conda

Build source

Supported models

Usage

INT4 GEMM vs INT4 GEMV vs FP16

Examples

Benchmarks

Vicuna 7B (LLaMa-2)

MPT 7B

Falcon 7B

Reference

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes