Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

| Roadmap | Examples | Issues: Help Wanted |

AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 2x while reducing memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT.

Latest News 🔥

  • [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available
  • [2023/08] PyPi package released and AutoModel class available

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install autoawq

Using conda

CUDA dependencies can be hard to manage sometimes. It is recommended to use conda with AutoAWQ:

conda create --name autoawq python=3.10 -y
conda activate autoawq
conda install pytorch=2.0.1 torchvision torchaudio cudatoolkit=11.8 -c pytorch -c nvidia
pip install autoawq

Build source

Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
GPTJ 6.7B

Usage

Below, you will find examples of how to easily quantize a model and run inference.

Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

Run inference on a quantized model from Huggingface:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

model.generate(...)

Benchmarks

Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.

Model GPU FP16 latency (ms) INT4 latency (ms) Speedup
LLaMA-2-7B 4090 19.97 8.66 2.31x
LLaMA-2-13B 4090 OOM 13.54 --
Vicuna-7B 4090 19.09 8.61 2.22x
Vicuna-13B 4090 OOM 12.17 --
MPT-7B 4090 17.09 12.58 1.36x
MPT-30B 4090 OOM 23.54 --
Falcon-7B 4090 29.91 19.84 1.51x
LLaMA-2-7B A6000 27.14 12.44 2.18x
LLaMA-2-13B A6000 47.28 20.28 2.33x
Vicuna-7B A6000 26.06 12.43 2.10x
Vicuna-13B A6000 44.91 17.30 2.60x
MPT-7B A6000 22.79 16.87 1.35x
MPT-30B A6000 OOM 31.57 --
Falcon-7B A6000 39.44 27.34 1.44x
Detailed benchmark (CPU vs. GPU)

Here is the difference between a fast and slow CPU on MPT-7B:

RTX 4090 + Intel i9 13900K (2 different VMs):

  • CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
  • CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):

  • CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
  • CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
  • CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

autoawq-0.0.2-cp311-cp311-win_amd64.whl (184.6 kB view details)

Uploaded CPython 3.11Windows x86-64

autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.11

autoawq-0.0.2-cp310-cp310-win_amd64.whl (183.9 kB view details)

Uploaded CPython 3.10Windows x86-64

autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.10

autoawq-0.0.2-cp39-cp39-win_amd64.whl (183.9 kB view details)

Uploaded CPython 3.9Windows x86-64

autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.9

autoawq-0.0.2-cp38-cp38-win_amd64.whl (183.6 kB view details)

Uploaded CPython 3.8Windows x86-64

autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl (3.4 MB view details)

Uploaded CPython 3.8

File details

Details for the file autoawq-0.0.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 184.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 acc445f15f1ab24db58ccfe211dd70467768cca7b49953630cc291e2921bcbe0
MD5 9aae639e72edab48d8c4501a22e6acf4
BLAKE2b-256 88158d1fd538c040923c34c6f44d1ea0da98109ac942991f65bed09e5432da45

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.2-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 990d25d4ef139984aa3504962959eaddc429988c42bb91b769af1411fb96bf3c
MD5 2e8dd11c779223e2f3bf35f2747518f8
BLAKE2b-256 2fe4fb9a8b3db9246849ca5f4f16b92f57def52ab70c719f76e454d7955d953c

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 183.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 418bc2df210d84cb9f055890e26c8d9d67b961fdbf1efef1f79f0727b7832f6a
MD5 fb40a2de82d9528e576becc2ffab381b
BLAKE2b-256 4e6ed761f6c7267106b19d2b5db4a345bba170d67ab04c502b0c1d914e6062f8

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.2-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 814099f41aab7ae87d8e9cefbb109b692ed01854e76d34e19bcaca45c40c258b
MD5 98d8884280ba8290ff307a73348b3dc3
BLAKE2b-256 9736da85da021983420d9386d817fe28685248405bd222ac6fea76fe898a6b37

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.2-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 183.9 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.2-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 71b71642ec01de50d161b8f694f5e466a0a79f2f4901b595422ba9dafd1b04b2
MD5 db4a5e5c4a340fa931eaaf8de9568f20
BLAKE2b-256 8f4571ea459f3cea0ae97a2bfffeb38fbec84b03e3dc51f1d976f4b89706bbe2

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.2-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec8ea5bd2170bb9ab56b33cc035334eea6278d4b06781eaceee3a6f480e754f6
MD5 31efb657d863cde7e9b5b28de82c482e
BLAKE2b-256 fbebedd71083eee76ac8c65dbef35992d09ffcd4d0cec3ec32d93ed493c7229c

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: autoawq-0.0.2-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 183.6 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for autoawq-0.0.2-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 e877a59d6f59e27112db61b682dfd0a042ae862640879c43affa9afc5f8d4f53
MD5 fce13db0f9d3e67e68fe5617edfbf1c1
BLAKE2b-256 b0ecc67c118b408ad61be53347fed4983d5e001e4cab21c4eff0de9ff83ad3de

See more details on using hashes here.

File details

Details for the file autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for autoawq-0.0.2-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e1d8740f4da789e3bf1f8960d692f1281a7604eb0c50bccabfc65c2102909cad
MD5 f35f4082a02f80384e878c06f1b5d74e
BLAKE2b-256 137f877466164b17a354553b9cb50803489c62bb90b31ef5ea93d54a7d28a5d2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page