Skip to main content

A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq

Project description

QLLM

Open In Colab GitHub - Releases PyPI - Downloads

KeyWords Quantization, GPTQ,AWQ, HQQ, ONNX, ONNXRuntime, VLLM

Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ in mixed bits(2-8bit), and export to onnx model

QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

  • GPTQ quantization
  • AWQ quantization
  • HQQ quantization

Features:

  • GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
  • We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
  • Auto promoting bits/group-size for better accuracy
  • Export to ONNX model, Running by OnnxRuntime

Latest News 🔥

  • [2024/03] ONNX Models export API
  • [2024/01] Support HQQ algorithm
  • [2023/12] The first PyPi package released

Installation

Easy to install qllm from PyPi [cu121]

pip install qllm

Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases

Build from Source

Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build

If you are using CUDA-121

pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation

OR CUDA-118/117

git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install

How to use it

Quantize llama2

#  Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

or you can convert a existing model in HF Hub

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

#  Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential

NOTE:

  1. only support GPTQ
  2. allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
  3. wjat different with gptq-for-llama is we grow bit by one instead of times 2.
  4. all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
  5. if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.

Quantize model for vLLM

Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq

COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ

If you use GEMM pack_mode, then you don't have to set the var

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM

Conversion among AWQ, GPTQ and MarLin

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ

Or you can use --pack_mode=AWQ to convert GPTQ to AWQ.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM

Or you can use --pack_mode=MARLIN to convert GPTQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Or you can use --pack_mode=MARLIN to convert AWQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Note:

Not all cases are supported, for example,

  1. if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
  2. if GPTQ model is quantized with --allow_mix_bits option, you can't convert it to AWQ.
  3. if GPTQ model is quantized with --act_order option, you can't convert it to AWQ.

model inference with the saved model

python -m qllm --load ./Llama-2-7b-4bit --eval

model inference with ORT

you may want to use genai to do generation with ORT.

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 
          'attention_mask': mask, 
          'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
          'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)

Load quantized model from hugingface/transformers

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ  --use_plugin

start a chatbot

you may need to install fschat and accelerate with pip

pip install fschat accelerate

use --use_plugin to enable a chatbot plugin

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

use QLLM with API

from qllm import AutoModelQuantization

quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)

OR

from qllm import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)

For some users has transformers connect issues.

Please set environment with PROXY_PORT=your http proxy port

PowerShell $env:PROXY_PORT=1080

Bash export PROXY_PORT=1080

windows cmd set PROXY_PORT=1080

Acknowledgements

GPTQ

GPTQ-triton

AutoGPTQ

llm-awq

AutoAWQ.

HQQ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

qllm-0.2.1-cp312-cp312-win_amd64.whl (963.6 kB view details)

Uploaded CPython 3.12Windows x86-64

qllm-0.2.1-cp312-cp312-manylinux2014_x86_64.whl (933.2 kB view details)

Uploaded CPython 3.12

qllm-0.2.1-cp311-cp311-win_amd64.whl (963.4 kB view details)

Uploaded CPython 3.11Windows x86-64

qllm-0.2.1-cp311-cp311-manylinux2014_x86_64.whl (932.9 kB view details)

Uploaded CPython 3.11

qllm-0.2.1-cp310-cp310-win_amd64.whl (960.9 kB view details)

Uploaded CPython 3.10Windows x86-64

qllm-0.2.1-cp310-cp310-manylinux2014_x86_64.whl (931.2 kB view details)

Uploaded CPython 3.10

File details

Details for the file qllm-0.2.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 963.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 86a3f0c023a49bab49f77e1e416d2a6bbf02d012db88d729e24f7e12d2212abc
MD5 2e220bf268db87c48a38309a21317f31
BLAKE2b-256 a60c6a4c903fb6597013633c68214ddda7cae10bbb5a67dbda636a4f4de5b19e

See more details on using hashes here.

File details

Details for the file qllm-0.2.1-cp312-cp312-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.1-cp312-cp312-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 73b013bf10443ac826587ddb451ec8c1af7b52a6f33cb75a9f90a899f0e14483
MD5 6ee52bd6868d54e6cd455e87479d8fea
BLAKE2b-256 6eeaa8cb6237075c16bd5b7843c5e578bf1a7761829a8b018bf3ad07b44e25bc

See more details on using hashes here.

File details

Details for the file qllm-0.2.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 963.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7ba38a2ee47e142c7162ddda2c6172bf58c50fe8a17130046a4fd0195d57f9f4
MD5 f96fb53aac088185e5f12c6f9dccb17b
BLAKE2b-256 bb4e380e3105349891f8f13d6ae9d9c368634272cef9198236ef6b6835814b59

See more details on using hashes here.

File details

Details for the file qllm-0.2.1-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5a6b78a293df5a9e0e2b13d8937d580b3518ad5bb62e4b765a7447877c2b5176
MD5 b20075621a02f6d11a46f751e87d80e0
BLAKE2b-256 29c6434d6c79dc44fa6a1763edb66d086513d9279db1977bad036d719baef683

See more details on using hashes here.

File details

Details for the file qllm-0.2.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 960.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 eaa499c9cf6c9669a25efcd796b527d4513690e63bdd43786b0b793bddab9e17
MD5 1f4b16490e726ca3cf9a8973884dfae1
BLAKE2b-256 5a6068ea098fbac69625d71748d83af8cd73edd0747841a92eef2e5300d9c241

See more details on using hashes here.

File details

Details for the file qllm-0.2.1-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a784e8da52974e8bd6893aa6e03a707a5fad16168243ba2d7d01f4fe8b11ee7b
MD5 ebd4ebd35fa42e0ac693e912d5c3020e
BLAKE2b-256 8946adf5e43674a6792f8d8475bc1cbb4cc209b586c50853b241639955065d1b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page