Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon).
  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch

Build time can take 10 minutes. Download your model while you install AutoAWQ.

git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models.

INT4 GEMM vs INT4 GEMV vs FP16

There are two versions of AWQ: GEMM and GEMV. Both names relate to how matrix multiplication runs under the hood. We suggest the following:

  • GEMV (quantized): Best for small context, batch size 1, highest number of tokens/s.
  • GEMM (quantized): Best for larger context, up to batch size 8, faster than GEMV on batch size > 1, slower than GEMV on batch size = 1.
  • FP16 (non-quantized): Best for large batch sizes of 8 or larger, highest throughput. We recommend TGI or vLLM.

Examples

Quantization

Expect this to take 10-15 minutes on smaller 7B models, and around 1 hour for 70B models.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
Inference
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer, TextStreamer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

# Load model
model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file, fuse_layers=True)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)

# Convert prompt to tokens
prompt_template = """\
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:"""

tokens = tokenizer(
    prompt_template.format(prompt="How are you today?"), 
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    tokens, 
    streamer=streamer,
    max_new_tokens=512
)
AutoAWQForCausalLM.from_quantized
  • quant_path: Path to folder containing model files.
  • quant_filename: The filename to model weights or index.json file.
  • max_new_tokens: The max sequence length, used to allocate kv-cache for fused models.
  • fuse_layers: Whether or not to use fused layers.
  • batch_size: The batch size to initialize the AWQ model with.

Benchmarks

Vicuna 7B (LLaMa-2)

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMV
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq-gemv
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 231.393 153.632 4.66 GB (19.68%)
1 64 64 233.909 154.475 4.66 GB (19.68%)
1 128 128 233.145 152.133 4.66 GB (19.68%)
1 256 256 228.562 147.692 4.67 GB (19.72%)
1 512 512 228.914 139.179 4.80 GB (20.26%)
1 1024 1024 227.393 125.058 5.56 GB (23.48%)
1 2048 2048 225.736 123.228 8.08 GB (34.09%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/vicuna-7b-v1.5-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 521.444 126.51 4.55 GB (19.21%)
1 64 64 2618.88 125.428 4.57 GB (19.31%)
1 128 128 2808.09 123.865 4.61 GB (19.44%)
1 256 256 2807.46 120.779 4.67 GB (19.72%)
1 512 512 2769.9 115.08 4.80 GB (20.26%)
1 1024 1024 2640.95 105.493 5.56 GB (23.48%)
1 2048 2048 2341.36 104.188 8.08 GB (34.09%)

MPT 7B

  • Note: Blazing fast generation, slow context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq-gemv
  • Version: GEMV
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 187.332 136.765 3.65 GB (15.42%)
1 64 64 241.026 136.476 3.67 GB (15.48%)
1 128 128 239.44 137.599 3.70 GB (15.61%)
1 256 256 233.184 137.02 3.76 GB (15.88%)
1 512 512 233.082 135.633 3.89 GB (16.41%)
1 1024 1024 231.504 122.197 4.40 GB (18.57%)
1 2048 2048 228.307 121.468 5.92 GB (24.98%)
  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Version: GEMM
  • Command: python examples/benchmark.py --model_path casperhansen/mpt-7b-8k-chat-awq
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 557.714 118.567 3.65 GB (15.42%)
1 64 64 2752.9 120.772 3.67 GB (15.48%)
1 128 128 2982.67 119.52 3.70 GB (15.61%)
1 256 256 3009.16 116.911 3.76 GB (15.88%)
1 512 512 2901.91 111.607 3.95 GB (16.68%)
1 1024 1024 2718.68 102.623 4.40 GB (18.57%)
1 2048 2048 2363.61 101.368 5.92 GB (24.98%)

Falcon 7B

  • Note: Fast generation, fast context processing
  • GPU: NVIDIA GeForce RTX 3090
  • Command: python examples/benchmark.py --model_path casperhansen/falcon-7b-awq --quant_file awq_model_w4_g64.pt
  • Version: GEMM
Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 32 32 466.826 95.1413 4.47 GB (18.88%)
1 64 64 1920.61 94.5963 4.48 GB (18.92%)
1 128 128 2406.1 94.793 4.48 GB (18.92%)
1 256 256 2521.08 94.1144 4.48 GB (18.92%)
1 512 512 2478.28 93.4123 4.48 GB (18.92%)
1 1024 1024 2256.22 94.0237 4.69 GB (19.78%)
1 2048 2048 1831.71 94.2032 6.83 GB (28.83%)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

autoawq-0.1.0-cp311-cp311-win_amd64.whl (221.6 kB view details)

Uploaded CPython 3.11Windows x86-64

autoawq-0.1.0-cp311-cp311-manylinux2014_x86_64.whl (17.5 MB view details)

Uploaded CPython 3.11

autoawq-0.1.0-cp310-cp310-win_amd64.whl (220.3 kB view details)

Uploaded CPython 3.10Windows x86-64

autoawq-0.1.0-cp310-cp310-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.10

autoawq-0.1.0-cp39-cp39-win_amd64.whl (220.5 kB view details)

Uploaded CPython 3.9Windows x86-64

autoawq-0.1.0-cp39-cp39-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.9

autoawq-0.1.0-cp38-cp38-win_amd64.whl (220.3 kB view details)

Uploaded CPython 3.8Windows x86-64

autoawq-0.1.0-cp38-cp38-manylinux2014_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 221.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f33c2b4f1fbb2a5706e544025f1fefa82c44bf9bdf74aaca92281f5932b65e51
MD5 67b0fe0c4c150d2f849cfa58a433cc87
BLAKE2b-256 e8c4d2f752bcea533fa79e0a7662489b463bf845e8469d71cf24d98445c49d6e

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7d5ddf87a3739b5eb5dfef1b4c4ac5e3d274c03ca2621432815cb328057c1de7
MD5 fc9bbe341767f71faa4e68fa008ba5cc
BLAKE2b-256 6bbb3b22af8288184ea44518a1930c9f09d6b683c75f6b41c873bd7a7173bf8a

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 220.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 c4de5ff08833fbeb2457dccab135a1535a49629864ef6f9494fc2a4cc3257877
MD5 f2e879c8310fef537b17803661b28c2c
BLAKE2b-256 1c4e3bbdd5567d8e709ef14f2514c45d3fb65a627d935c86f72816e2b0dfcaa5

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0ea23d9b918ba483fd5feebd00651f17dda3d50437f70fb66569ed49f120b5aa
MD5 768e893f3423575722602f6cd4a86433
BLAKE2b-256 e0454ec85c794a9436f14b8ed408511300622f6fc4ca8ab58ec7e34ddf5f2496

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 220.5 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 68d6674d69089f137eb79807e16e73d179f3335f4c497b7042e2870809c0efd0
MD5 1965577e4a7dd7910c9a45811d42fe72
BLAKE2b-256 be57e5a993ceb75c3023b4b6bb324cf8a66625283e4d3b2a5716032087f4b789

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e7c3f9d624976c4283f0d0452909fa17172f5f4ad49a2270d52b57baeef61259
MD5 12c8183dbf29a0fba26012a2bc9c929d
BLAKE2b-256 8065b46ba3664fb23a2f331971cc6faff9d7fb99a45421a4234f51524e322f3d

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 220.3 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 d33af923007e22ac775cec5b55eebfb8e82d6b053b25e6bf05df3290e9e8d8cf
MD5 80bffa040ec929a08a74ed5a9c58b70b
BLAKE2b-256 b057df0d73b351d5b7a3db650ed328eb61d588c0b53316b6ec45f2f02527168b

See more details on using hashes here.

File details

Details for the file autoawq-0.1.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.1.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec888e29a9423e9dd52ad30e04b2db1eccb47f031e30a59e531e2e6ca9887a35
MD5 931a3a0cea16635b2b5a9b579bf5b1de
BLAKE2b-256 332c4ac81b7f940426de9a9497c13b702529e3d2f4efea791e24572438735e60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page