Skip to main content

A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq/vptq

Project description

QLLM

Open In Colab GitHub - Releases PyPI - Downloads

KeyWords Quantization, GPTQ,AWQ, HQQ, VPTQ, ONNX, ONNXRuntime, VLLM

Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ/VPTQ in mixed bits(2-8bit), and export to onnx model

QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ/VPTQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

  • GPTQ quantization
  • AWQ quantization
  • HQQ quantization
  • VPTQ quantization

Features:

  • GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
  • We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
  • Auto promoting bits/group-size for better accuracy
  • Export to ONNX model, inference by OnnxRuntime

Latest News 🔥

  • [2026/03] CUDA 13.0 support, PyTorch 2.10, Python 3.11-3.13
  • [2026/03] Support H100/H200 (sm_90), B200/B300 (sm_100), RTX 5090 (sm_120)
  • [2024/03] ONNX Models export API
  • [2024/01] Support HQQ algorithm
  • [2023/12] The first PyPi package released

Installation

Easy to install qllm from PyPi

pip install qllm

Install from release package, CUDA 13.0 is supported. [py311, py312, py313] https://github.com/wejoncy/QLLM/releases

Build from Source

Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build

pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation

How to use it

Quantize llama2

#  Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

or you can convert a existing model in HF Hub

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

#  Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential

NOTE:

  1. only support GPTQ
  2. allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
  3. wjat different with gptq-for-llama is we grow bit by one instead of times 2.
  4. all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
  5. if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.

Quantize model for vLLM

Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq

COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ

If you use GEMM pack_mode, then you don't have to set the var

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM

Conversion among AWQ, GPTQ and MarLin

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ

Or you can use --pack_mode=AWQ to convert GPTQ to AWQ.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM

Or you can use --pack_mode=MARLIN to convert GPTQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Or you can use --pack_mode=MARLIN to convert AWQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Note:

Not all cases are supported, for example,

  1. if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
  2. if GPTQ model is quantized with --allow_mix_bits option, you can't convert it to AWQ.
  3. if GPTQ model is quantized with --act_order option, you can't convert it to AWQ.

model inference with the saved model

python -m qllm --load ./Llama-2-7b-4bit --eval

model inference with ORT

you may want to use genai to do generation with ORT.

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 
          'attention_mask': mask, 
          'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
          'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)

Load quantized model from hugingface/transformers

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ  --use_plugin

start a chatbot

you may need to install fschat and accelerate with pip

pip install fschat accelerate

use --use_plugin to enable a chatbot plugin

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

use QLLM with API

from qllm.auto_model_quantization import AutoModelQuantization

quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)

OR

from qllm.auto_model_quantization import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)

For some users has transformers connect issues.

Please set environment with PROXY_PORT=your http proxy port

PowerShell $env:PROXY_PORT=1080

Bash export PROXY_PORT=1080

windows cmd set PROXY_PORT=1080

Acknowledgements

GPTQ

GPTQ-triton

AutoGPTQ

llm-awq

AutoAWQ.

HQQ

VPTQ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qllm-0.2.3.1-cp313-cp313-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.13Windows x86-64

qllm-0.2.3.1-cp313-cp313-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.13

qllm-0.2.3.1-cp312-cp312-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.12Windows x86-64

qllm-0.2.3.1-cp312-cp312-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.12

qllm-0.2.3.1-cp311-cp311-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.11Windows x86-64

qllm-0.2.3.1-cp311-cp311-manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.11

File details

Details for the file qllm-0.2.3.1-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3.1-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3.1-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 0352e43602b6f225d8671107298af42dd7c0855acd8577e363e53bbd3dadabd1
MD5 64f76a14b1931f7dc0e29d6e487e0e56
BLAKE2b-256 840461e31b3af8ccfc1c9b13f8c8fd3e92901497f2f6d51a85c861a92c5ed432

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp313-cp313-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3.1-cp313-cp313-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3.1-cp313-cp313-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 37e7de689f804f0a0e0b6eb9750582d79667aa82839a89454e9117392e009819
MD5 22c8c9fab479ecb5f1681e2028625c25
BLAKE2b-256 63569f2f8c098b26e9bcf2057fa94976eab6d9e2d30bca6d42ec3470fdedcc91

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp313-cp313-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a74abc8b346a4d310111e44e7025945791ad0583d51c629c3945936debd6d6ff
MD5 f6110b1b2a631f8fa4c8fff2198f24cb
BLAKE2b-256 2203dbb024bad8c6919612220dbfc0da46d51db8f61276dd8dcaf37c8ed0a8c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp312-cp312-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3.1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3.1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 18ce1acdd70bbeb3a00685698e8fb8f7dea384a68d77ac32f0c8eb2bc3f6ec2a
MD5 429f4310540c5257d7a6cb8bd0bec9e0
BLAKE2b-256 7dae94111afd5d9d65dc05c4266cbb00e24fefcc6b6b415f66e50bbf92086eba

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp312-cp312-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.3.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for qllm-0.2.3.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f2b93861fa3b914fd286bc4d1218024a99d04fbdd7873aed9ddd9e37c1004de9
MD5 83ea7b317f554a584c0d13df68d7f248
BLAKE2b-256 881bf6bfbc85c26c7552670fce68d6fda0f78550d1856c22d03b8ef864ccd87c

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp311-cp311-win_amd64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file qllm-0.2.3.1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.3.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a0e8797ea8778a0c1e6c8328c3f23af9af6c027b1b2522be599d425cf6aaad8f
MD5 2d3a7116bb968d1d47fd6e67f5926e31
BLAKE2b-256 62b738e2154690a808cc2c978c62301d0bba8ddf3a9d8873164e27b319f781b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for qllm-0.2.3.1-cp311-cp311-manylinux2014_x86_64.whl:

Publisher: deploy.yml on wejoncy/QLLM

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page