Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

Huggingface - Models GitHub - Releases PyPI - Downloads

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.1.3-cp311-cp311-win_amd64.whl (245.0 kB view details)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.11

autoawq-0.1.3-cp310-cp310-win_amd64.whl (244.0 kB view details)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.10

autoawq-0.1.3-cp39-cp39-win_amd64.whl (244.1 kB view details)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl (20.0 MB view details)

Uploaded CPython 3.9

autoawq-0.1.3-cp38-cp38-win_amd64.whl (243.4 kB view details)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl (19.9 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 245.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ca94be6dbe4d97e715b3c70017680a767264746e737a00b339455fb519280585
MD5 6ca9cc0c9ff3ae0b098f453c0cb7242b
BLAKE2b-256 fd1d0f3bea35aea9d8a646fefb098c78dc5af0260203549710f47d0a75420b04

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c674adcbfea3af574bd03da2fe2a74f1ca274b57fda0e59d8364425469a06783
MD5 a364087efbfb4edf77fb53a79521005f
BLAKE2b-256 15e871b6e90c495d0a53cde93cfde11e4c965614f890bc50de784958cc0b0fea

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 244.0 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1918d8a26d01ff8127a2e9271026be342406b6d6c02141c3aa611617266bb4c2
MD5 90c5adf73e82350304870676b8f14e11
BLAKE2b-256 49785acb52da285bba7aef386b9e7d4773b52fcb2230a20da44ffde2c8e6511c

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.3-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 785c33d6493130aaec9d0af341d5aafbfa490df291c8b0118e68d74564360280
MD5 27d2756e221ca37eb88292b3bbf8c9b9
BLAKE2b-256 9621a66c36c632951facda8a9cfbb2a25bb28558bfb4dd1acec9899e8a60daa3

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.3-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 244.1 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 96876670078c6e83748553f5c659dfd7043d69839b0af0535b261c2e2ca192dd
MD5 075d12837aa0e39766c2df96aa2c74a3
BLAKE2b-256 6aea96432ac13696dd3c082896b3cf095e80c5b589e86067f7527b0008587eec

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.3-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 175334d5429779333a1881faa7cd0628ab3c3c0eb82af09094b97d103dfc1c08
MD5 88aa9409846025fc5f29d70230052a2a
BLAKE2b-256 13fef6c2ee6b1d8ae04ee2520bdcfc61414b8f98c5975e5eb2caa5c6700d1a59

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.3-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 243.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.3-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 afff79a385ba7384d8aab366585e5c2fd8179d0660266b2384859a89018d48a0
MD5 664b477691a5cd2dace67a1a45943994
BLAKE2b-256 8c29c893f096b6f29a72c743b7b6930e4fdca2e273c0a203ab3ccdf34d68da79

See more details on using hashes here.

File details

Details for the file autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.3-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 704127b46fce550669d9ac7b1cfb626b9d226bf42062ed946bdbd3f2d285afb6
MD5 fb33fdf4d5dde0d12de67360f0f57ffa
BLAKE2b-256 f45b50f79271a6f84b0acab172d6bd5478c72e6743e136fc5916f870e4c6a59b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page