Skip to main content

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference.

Project description

AutoAWQ

AutoAWQ is a package that implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ will speed up your LLM by at least 2x compared to FP16. AutoAWQ was created and improved upon from the original work from MIT.

Roadmap:

  • Publish pip package
  • Refactor quantization code
  • Support more models
  • Optimize the speed of models

Install

Requirements:

  • Compute Capability 8.0 (sm80). Ampere and later architectures are supported.
  • CUDA Toolkit 11.8 and later.

Install:

  • Use pip to install awq
pip install awq

Build source

Build AutoAWQ from scratch
git clone https://github.com/casper-hansen/AutoAWQ
cd AutoAWQ
pip install -e .

Supported models

The detailed support list:

Models Sizes
LLaMA-2 7B/13B/70B
LLaMA 7B/13B/30B/65B
Vicuna 7B/13B
MPT 7B/30B
Falcon 7B/40B
OPT 125m/1.3B/2.7B/6.7B/13B/30B
Bloom 560m/3B/7B/
LLaVA-v0 13B
GPTJ 6.7B

Usage

Below, you will find examples for how to easily quantize a model and run inference.

Quantization

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'lmsys/vicuna-7b-v1.5'
quant_path = 'vicuna-7b-v1.5-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Inference

Run inference on a quantized model from Huggingface:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

quant_path = "casperhansen/vicuna-7b-v1.5-awq"
quant_file = "awq_model_w4_g128.pt"

model = AutoAWQForCausalLM.from_quantized(quant_path, quant_file)
tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

model.generate(...)

Benchmarks

Benchmark speeds may vary from server to server and that it also depends on your CPU. If you want to minimize latency, you should rent a GPU/CPU combination that has high memory bandwidth for both and high single-core speed for CPU.

Model GPU FP16 latency (ms) INT4 latency (ms) Speedup
LLaMA-2-7B 4090 19.97 8.66 2.31x
LLaMA-2-13B 4090 OOM 13.54 --
Vicuna-7B 4090 19.09 8.61 2.22x
Vicuna-13B 4090 OOM 12.17 --
MPT-7B 4090 17.09 12.58 1.36x
MPT-30B 4090 OOM 23.54 --
Falcon-7B 4090 29.91 19.84 1.51x
LLaMA-2-7B A6000 27.14 12.44 2.18x
LLaMA-2-13B A6000 47.28 20.28 2.33x
Vicuna-7B A6000 26.06 12.43 2.10x
Vicuna-13B A6000 44.91 17.30 2.60x
MPT-7B A6000 22.79 16.87 1.35x
MPT-30B A6000 OOM 31.57 --
Falcon-7B A6000 39.44 27.34 1.44x
Detailed benchmark (CPU vs. GPU)

Here is the difference between a fast and slow CPU on MPT-7B:

RTX 4090 + Intel i9 13900K (2 different VMs):

  • CUDA 12.0, Driver 525.125.06: 134 tokens/s (7.46 ms/token)
  • CUDA 12.0, Driver 525.125.06: 117 tokens/s (8.52 ms/token)

RTX 4090 + AMD EPYC 7-Series (3 different VMs):

  • CUDA 12.2, Driver 535.54.03: 53 tokens/s (18.6 ms/token)
  • CUDA 12.2, Driver 535.54.03: 56 tokens/s (17.71 ms/token)
  • CUDA 12.0, Driver 525.125.06: 55 tokens/ (18.15 ms/token)

Reference

If you find AWQ useful or relevant to your research, you can cite their paper:

@article{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
  journal={arXiv},
  year={2023}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

autoawq-0.0.1-cp311-cp311-win_amd64.whl (175.5 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

autoawq-0.0.1-cp311-cp311-manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.11

autoawq-0.0.1-cp310-cp310-win_amd64.whl (174.8 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

autoawq-0.0.1-cp310-cp310-manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.10

autoawq-0.0.1-cp39-cp39-win_amd64.whl (174.8 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

autoawq-0.0.1-cp39-cp39-manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.9

autoawq-0.0.1-cp38-cp38-win_amd64.whl (174.5 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

autoawq-0.0.1-cp38-cp38-manylinux2014_x86_64.whl (3.3 MB view hashes)

Uploaded CPython 3.8

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page