Skip to main content

A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq/vptq

Project description

QLLM

Open In Colab GitHub - Releases PyPI - Downloads

KeyWords Quantization, GPTQ,AWQ, HQQ, VPTQ, ONNX, ONNXRuntime, VLLM

Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ/VPTQ in mixed bits(2-8bit), and export to onnx model

QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ/VPTQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

  • GPTQ quantization
  • AWQ quantization
  • HQQ quantization
  • VPTQ quantization

Features:

  • GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
  • We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
  • Auto promoting bits/group-size for better accuracy
  • Export to ONNX model, inference by OnnxRuntime

Latest News 🔥

  • [2024/03] ONNX Models export API
  • [2024/01] Support HQQ algorithm
  • [2023/12] The first PyPi package released

Installation

Easy to install qllm from PyPi [cu124]

pip install qllm

Install from release package, CUDA-124 is supported. [py310,py311,py312] https://github.com/wejoncy/QLLM/releases

Build from Source

Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build

If you are using CUDA-124

pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation

OR CUDA-118/121

git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install

How to use it

Quantize llama2

#  Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

or you can convert a existing model in HF Hub

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

#  Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential

NOTE:

  1. only support GPTQ
  2. allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
  3. wjat different with gptq-for-llama is we grow bit by one instead of times 2.
  4. all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
  5. if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.

Quantize model for vLLM

Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq

COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ

If you use GEMM pack_mode, then you don't have to set the var

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM

Conversion among AWQ, GPTQ and MarLin

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ

Or you can use --pack_mode=AWQ to convert GPTQ to AWQ.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM

Or you can use --pack_mode=MARLIN to convert GPTQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Or you can use --pack_mode=MARLIN to convert AWQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Note:

Not all cases are supported, for example,

  1. if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
  2. if GPTQ model is quantized with --allow_mix_bits option, you can't convert it to AWQ.
  3. if GPTQ model is quantized with --act_order option, you can't convert it to AWQ.

model inference with the saved model

python -m qllm --load ./Llama-2-7b-4bit --eval

model inference with ORT

you may want to use genai to do generation with ORT.

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 
          'attention_mask': mask, 
          'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
          'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)

Load quantized model from hugingface/transformers

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ  --use_plugin

start a chatbot

you may need to install fschat and accelerate with pip

pip install fschat accelerate

use --use_plugin to enable a chatbot plugin

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

use QLLM with API

from qllm.auto_model_quantization import AutoModelQuantization

quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)

OR

from qllm.auto_model_quantization import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)

For some users has transformers connect issues.

Please set environment with PROXY_PORT=your http proxy port

PowerShell $env:PROXY_PORT=1080

Bash export PROXY_PORT=1080

windows cmd set PROXY_PORT=1080

Acknowledgements

GPTQ

GPTQ-triton

AutoGPTQ

llm-awq

AutoAWQ.

HQQ

VPTQ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qllm-0.2.3-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

qllm-0.2.3-cp313-cp313-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13

qllm-0.2.3-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

qllm-0.2.3-cp312-cp312-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12

qllm-0.2.3-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

qllm-0.2.3-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

File details

Details for the file qllm-0.2.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 33a67ad7650a06b4faf1a0c5f597de8a85197c44c1a03c4034914c4b07aa37b9
MD5 193d3c53fde889fefd8cf5d0e73e4cec
BLAKE2b-256 0c34fd8269113bd1c07d09fd1e2b3921c1f42bde5e09593babf93f9c2672da86

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp313-cp313-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 456337fc02b496356897c23caa4d265abada3df08df4a56449aba767216af9e9
MD5 9df3429ee1af2558b19798768d4a48e5
BLAKE2b-256 1c3812a299f5b7faac6754912bf775288fbd07c705ed12acff83d5b96bd25594

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp313-cp313-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 142b8366e5bbc76d4965e050f34c422b22716824b20b99d0337df97dca20f3fc
MD5 f5aba1514b0226f6465193e4d4468c1f
BLAKE2b-256 4357be4b1b929862352cc75e04742a6dc25d08d408532aa4760aae9e419da7c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp312-cp312-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a11b586e1cdfd8a60ba44f2bde86174237ec963911db2177aaa88335d2235ff
MD5 84ea8e0d3fb0de3eaaa7bdfe5e85172c
BLAKE2b-256 71c544288f9558db1c3caf42e4975ecd95d28b061a50407f2b78c6644815fa67

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp312-cp312-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 fda8cecd930f345eb13bf2e67b42d7473e5b8afa1a35f0403319088243c05ea9
MD5 fc291b01dd18b808bcf86091317a404f
BLAKE2b-256 137889cbeb51bbbffc148e00af3f9aa062a59c7db5918d72ff437458bad80d42

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp311-cp311-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 496bea733de05f83ad5d6874fecf5924e2ec4342065ea19cd74f6aa3d9144c0c
MD5 36e70067aa1ec1c926d28c1c1ea35be8
BLAKE2b-256 25d69987f7456122dc6a06ecc093cb392117bf64e4b5b2b16d054a840d507a07

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3-cp311-cp311-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page