Skip to main content

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Project description

AutoGPTQ

An easy-to-use LLM quantization package with user-friendly APIs, based on GPTQ algorithm (weight-only quantization).

GitHub release PyPI - Downloads

English | 中文

News or Update

  • 2024-02-15 - (News) - AutoGPTQ 0.7.0 is released, with Marlin int4*fp16 matrix multiplication kernel support, with the argument use_marlin=True when loading models.
  • 2023-08-23 - (News) - 🤗 Transformers, optimum and peft have integrated auto-gptq, so now running and training GPTQ models can be more available to everyone! See this blog and it's resources for more details!

For more histories please turn to here

Performance Comparison

Inference Speed

The result is generated using this script, batch size of input is 1, decode strategy is beam search and enforce the model to generate 512 tokens, speed metric is tokens/s (the larger, the better).

The quantized model is loaded using the setup that can gain the fastest inference speed.

model GPU num_beams fp16 gptq-int4
llama-7b 1xA100-40G 1 18.87 25.53
llama-7b 1xA100-40G 4 68.79 91.30
moss-moon 16b 1xA100-40G 1 12.48 15.25
moss-moon 16b 1xA100-40G 4 OOM 42.67
moss-moon 16b 2xA100-40G 1 06.83 06.78
moss-moon 16b 2xA100-40G 4 13.10 10.80
gpt-j 6b 1xRTX3060-12G 1 OOM 29.55
gpt-j 6b 1xRTX3060-12G 4 OOM 47.36

Perplexity

For perplexity comparison, you can turn to here and here

Installation

AutoGPTQ is available on Linux and Windows only. You can install the latest stable release of AutoGPTQ from pip with pre-built wheels:

CUDA/ROCm version Installation Built against PyTorch
CUDA 11.8 pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ 2.2.0+cu118
CUDA 12.1 pip install auto-gptq 2.2.0+cu121
ROCm 5.7 pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ 2.2.0+rocm5.7

AutoGPTQ can be installed with the Triton dependency with pip install auto-gptq[triton] in order to be able to use the Triton backend (currently only supports linux, no 3-bits quantization).

For older AutoGPTQ, please refer to the previous releases installation table.

Install from source

Clone the source code:

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

A few packages are required in order to build from source: pip install numpy gekko pandas.

Then, install locally from source:

pip install -vvv -e .

You can set BUILD_CUDA_EXT=0 to disable pytorch extension building, but this is strongly discouraged as AutoGPTQ then falls back on a slow python implementation.

On ROCm systems

To install from source for AMD GPUs supporting ROCm, please specify the ROCM_VERSION environment variable. Example:

ROCM_VERSION=5.6 pip install -vvv -e .

The compilation can be speeded up by specifying the PYTORCH_ROCM_ARCH variable (reference) in order to build for a single target device, for example gfx90a for MI200 series devices.

For ROCm systems, the packages rocsparse-dev, hipsparse-dev, rocthrust-dev, rocblas-dev and hipblas-dev are required to build.

Quick Tour

Quantization and Inference

warning: this is just a showcase of the usage of basic apis in AutoGPTQ, which uses only one sample to quantize a much small model, quality of quantized model using such little samples may not good.

Below is an example for the simplest use of auto_gptq to quantize a model and inference after quantization:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples = [
    tokenizer(
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model.quantize(examples)

# save quantized model
model.save_quantized(quantized_model_dir)

# save quantized model using safetensors
model.save_quantized(quantized_model_dir, use_safetensors=True)

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

For more advanced features of model quantization, please reference to this script

Customize Model

Below is an example to extend `auto_gptq` to support `OPT` model, as you will see, it's very easy:
from auto_gptq.modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM(BaseGPTQForCausalLM):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out",
        "model.decoder.project_in", "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"],
        ["self_attn.out_proj"],
        ["fc1"],
        ["fc2"]
    ]

After this, you can use OPTGPTQForCausalLM.from_pretrained and other methods as shown in Basic.

Evaluation on Downstream Tasks

You can use tasks defined in auto_gptq.eval_tasks to evaluate model's performance on specific down-stream task before and after quantization.

The predefined tasks support all causal-language-models implemented in 🤗 transformers and in this project.

Below is an example to evaluate `EleutherAI/gpt-j-6b` on sequence-classification task using `cardiffnlp/tweet_sentiment_multilingual` dataset:
from functools import partial

import datasets
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from auto_gptq.eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:"
ID2LABEL = {
    0: "negative",
    1: "neutral",
    2: "positive"
}
LABELS = list(ID2LABEL.values())


def ds_refactor_fn(samples):
    text_data = samples["text"]
    label_data = samples["label"]

    new_samples = {"prompt": [], "label": []}
    for text, label in zip(text_data, label_data):
        prompt = TEMPLATE.format(labels=LABELS, text=text)
        new_samples["prompt"].append(prompt)
        new_samples["label"].append(ID2LABEL[label])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig())
tokenizer = AutoTokenizer.from_pretrained(MODEL)

task = SequenceClassificationTask(
        model=model,
        tokenizer=tokenizer,
        classes=LABELS,
        data_name_or_path=DATASET,
        prompt_col_name="prompt",
        label_col_name="label",
        **{
            "num_samples": 1000,  # how many samples will be sampled to evaluation
            "sample_max_len": 1024,  # max tokens for each sample
            "block_max_len": 2048,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn": partial(datasets.load_dataset, name="english"),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn": ds_refactor_fn,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt": False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print(task.run())

# self-consistency
print(
    task.run(
        generation_config=GenerationConfig(
            num_beams=3,
            num_return_sequences=3,
            do_sample=True
        )
    )
)

Learn More

tutorials provide step-by-step guidance to integrate auto_gptq with your own project and some best practice principles.

examples provide plenty of example scripts to use auto_gptq in different ways.

Supported Models

you can use model.config.model_type to compare with the table below to check whether the model you use is supported by auto_gptq.

for example, model_type of WizardLM, vicuna and gpt4all are all llama, hence they are all supported by auto_gptq.

model type quantization inference peft-lora peft-ada-lora peft-adaption_prompt
bloom
gpt2
gpt_neox requires this peft branch
gptj requires this peft branch
llama
moss requires this peft branch
opt
gpt_bigcode
codegen
falcon(RefinedWebModel/RefinedWeb)

Supported Evaluation Tasks

Currently, auto_gptq supports: LanguageModelingTask, SequenceClassificationTask and TextSummarizationTask; more Tasks will come soon!

Running tests

Tests can be run with:

pytest tests/ -s

FAQ

Which kernel is used by default?

AutoGPTQ defaults to using exllamav2 int4*fp16 kernel for matrix multiplication.

How to use Marlin kernel?

Marlin is an optimized int4 * fp16 kernel was recently proposed at https://github.com/IST-DASLab/marlin. This is integrated in AutoGPTQ when loading a model with use_marlin=True. This kernel is available only on devices with compute capability 8.0 or 8.6 (Ampere GPUs).

Acknowledgement

  • Special thanks Elias Frantar, Saleh Ashkboos, Torsten Hoefler and Dan Alistarh for proposing GPTQ algorithm and open source the code, and for releasing Marlin kernel for mixed precision computation.
  • Special thanks qwopqwop200, for code in this project that relevant to quantization are mainly referenced from GPTQ-for-LLaMa.
  • Special thanks to turboderp, for releasing Exllama and Exllama v2 libraries with efficient mixed precision kernels.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_gptq-0.7.1.tar.gz (126.1 kB view details)

Uploaded Source

Built Distributions

auto_gptq-0.7.1-cp311-cp311-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.11 Windows x86-64

auto_gptq-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

auto_gptq-0.7.1-cp310-cp310-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.10 Windows x86-64

auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

auto_gptq-0.7.1-cp39-cp39-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.9 Windows x86-64

auto_gptq-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

auto_gptq-0.7.1-cp38-cp38-win_amd64.whl (4.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

auto_gptq-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

File details

Details for the file auto_gptq-0.7.1.tar.gz.

File metadata

  • Download URL: auto_gptq-0.7.1.tar.gz
  • Upload date:
  • Size: 126.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for auto_gptq-0.7.1.tar.gz
Algorithm Hash digest
SHA256 5c61ad380e9b4c603757c254765e9083a90a820cd0aff1b5d2c6f7fd96c85e80
MD5 2e2f332fc20e335f5d1a958b21849783
BLAKE2b-256 90e5b22697903982284fe284568fb2663a2196694a8eee637f5cf4ccfe435a38

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8d9d3fbfff94e129bcde0432fd5f6ba09f26aeceea0890a68c898210ba448135
MD5 938008d664b615b34dc4c30ea8feea55
BLAKE2b-256 e28e6c90faf9c4058215341523c1fd0e70ac49c504023af2fcd0bd89a825dd72

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a372d81d523b8d2f98639bbca63c0c427cab0ed55b8a57bb96706b0942aa01c9
MD5 ac80ef8106760fcbe112763667862015
BLAKE2b-256 97327cc2f3bd401021fe9af085e78c2b8c24d6521bec324918c8c5660c650074

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 5d0f3c1b4905e1f14f5b88a33d3dfaf313ebfa138f73c43315f80b40a2d8bb1f
MD5 dbf2690d429fe2b53b8c870cd7e6a511
BLAKE2b-256 e64f010daf242f5da59dce5dabf32aaf2a00411f8334e3a83759d0ed798127b3

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 84efeb9527b2d49f3555b2ba9451018c5ff41722c0b0e2fde74f5183ba0c008d
MD5 6e1e8d5fdf104c2bf2e9f6f0f03659f0
BLAKE2b-256 5ba60d099a45b2f09dfc7dbd97c97b478d4db488f9168a019aa841b714b82fd7

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: auto_gptq-0.7.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for auto_gptq-0.7.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 57b4b284ca4ff4e060088252cbfb18a6f7332477fe57cf2b7ef9af158cce9730
MD5 a29d4389debe2920f2f6c179b1b843a0
BLAKE2b-256 ad63b17f250d017a31dd2f1d7dc3ade5466c4430fc4eb7a7d57e9bbf641f41b9

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4a1e5972b8ff65956f207908d8789c56fc0d521678ff37bd57e98f392971b3fd
MD5 a230f6b5c5ef509df8341a30d252770e
BLAKE2b-256 7da6a4e62a3e834654025347918ca042f7d69eb698609753d054da92f396dc2d

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: auto_gptq-0.7.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 4.6 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.12

File hashes

Hashes for auto_gptq-0.7.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 d87cd69532bfeb81150a0f8db5ebb5f5f295fa72b5a0403102e791ccac44b933
MD5 1072cd7c36b68b215cdcbec3f57504b8
BLAKE2b-256 b69118af8fd1d499d745e3c82e77d1dfa5d53f1b8616af5a3c14b8289c879018

See more details on using hashes here.

File details

Details for the file auto_gptq-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for auto_gptq-0.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fc69a93ab8b24e7854abbb3adbdb3a8c28c8f4db9c82229341922a4b59e0eff1
MD5 4107ff86d36183b4e86060abdcb3fe91
BLAKE2b-256 1672de4c04fb0038681b64c2c6865b8c74ae4afdf2a7a73dab1f0ec7dc4c8214

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page