Skip to main content

A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq

Project description

QLLM

Open In Colab GitHub - Releases PyPI - Downloads

KeyWords Quantization, GPTQ,AWQ, HQQ, ONNX, ONNXRuntime, VLLM

Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ in mixed bits(2-8bit), and export to onnx model

QLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

  • GPTQ quantization
  • AWQ quantization
  • HQQ quantization

Features:

  • GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
  • We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
  • Auto promoting bits/group-size for better accuracy
  • Export to ONNX model, Running by OnnxRuntime

Latest News 🔥

  • [2024/03] ONNX Models export API
  • [2024/01] Support HQQ algorithm
  • [2023/12] The first PyPi package released

Installation

Easy to install qllm from PyPi [cu121]

pip install qllm

Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases

Build from Source

Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build

If you are using CUDA-121

pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation

OR CUDA-118/117

git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install

How to use it

Quantize llama2

#  Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

or you can convert a existing model in HF Hub

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx

(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)

#  Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential

NOTE:

  1. only support GPTQ
  2. allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
  3. wjat different with gptq-for-llama is we grow bit by one instead of times 2.
  4. all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
  5. if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.

Quantize model for vLLM

Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq

COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ

If you use GEMM pack_mode, then you don't have to set the var

python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM

Conversion among AWQ, GPTQ and MarLin

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ

Or you can use --pack_mode=AWQ to convert GPTQ to AWQ.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM

Or you can use --pack_mode=MARLIN to convert GPTQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Or you can use --pack_mode=MARLIN to convert AWQ to Marlin.

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN

Note:

Not all cases are supported, for example,

  1. if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
  2. if GPTQ model is quantized with --allow_mix_bits option, you can't convert it to AWQ.
  3. if GPTQ model is quantized with --act_order option, you can't convert it to AWQ.

model inference with the saved model

python -m qllm --load ./Llama-2-7b-4bit --eval

model inference with ORT

you may want to use genai to do generation with ORT.

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 
          'attention_mask': mask, 
          'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
          'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)

Load quantized model from hugingface/transformers

python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ  --use_plugin

start a chatbot

you may need to install fschat and accelerate with pip

pip install fschat accelerate

use --use_plugin to enable a chatbot plugin

python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

use QLLM with API

from qllm import AutoModelQuantization

quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)

OR

from qllm import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)

For some users has transformers connect issues.

Please set environment with PROXY_PORT=your http proxy port

PowerShell $env:PROXY_PORT=1080

Bash export PROXY_PORT=1080

windows cmd set PROXY_PORT=1080

Acknowledgements

GPTQ

GPTQ-triton

AutoGPTQ

llm-awq

AutoAWQ.

HQQ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

qllm-0.2.0-cp311-cp311-win_amd64.whl (944.4 kB view details)

Uploaded CPython 3.11 Windows x86-64

qllm-0.2.0-cp311-cp311-manylinux2014_x86_64.whl (922.8 kB view details)

Uploaded CPython 3.11

qllm-0.2.0-cp310-cp310-win_amd64.whl (941.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

qllm-0.2.0-cp310-cp310-manylinux2014_x86_64.whl (921.5 kB view details)

Uploaded CPython 3.10

qllm-0.2.0-cp39-cp39-win_amd64.whl (941.3 kB view details)

Uploaded CPython 3.9 Windows x86-64

qllm-0.2.0-cp39-cp39-manylinux2014_x86_64.whl (920.9 kB view details)

Uploaded CPython 3.9

qllm-0.2.0-cp38-cp38-win_amd64.whl (941.9 kB view details)

Uploaded CPython 3.8 Windows x86-64

qllm-0.2.0-cp38-cp38-manylinux2014_x86_64.whl (920.9 kB view details)

Uploaded CPython 3.8

File details

Details for the file qllm-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 944.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e3b9e4f35da64bcb03a6b7c54bf8d54ef1dd8a1e57391945b009981a8199c1d0
MD5 7c4b78abcd1b7c1377aed7bcc81c73fc
BLAKE2b-256 22bbc99546cfa4b389d62210da20431f1c98afd63a55cd47dd5fa151858da243

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp311-cp311-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.0-cp311-cp311-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a46dabdd36f2a8979c9e2c64550db3059eeb7fa5f451c49512e250e11b24c241
MD5 2d16ea3df5bf4d3fbe01f69ed292bfa1
BLAKE2b-256 a576e8553c5c5debdb58179cc71d95c303e8273d1927d8ad18db30f44c092eed

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 941.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 49ea032119e5b8f6b58dccc9b34a371fd96eeb659fc468f2ea477f0f26134150
MD5 b603d9c43a8a315d4f748aef5a4a9345
BLAKE2b-256 82d0d4a1fe3ad053772ac472ec513f24ddd2950e1a80f74c4db987bae111e139

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp310-cp310-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.0-cp310-cp310-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0736e2a3e7fbb791a11edb55d864940751d93cb265b65bc72e426b3a38b3c04d
MD5 69c72f833b03f1ab48501f4eb82e04fa
BLAKE2b-256 d52549a9a81397fee7965b1142912805eeebcd50f14396723d424bf2e7bb40ee

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.0-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 941.3 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 79af262bf9f16cf89b946d1322f1ed08aeab0e6b67e80fb2f5bbaad34dfde884
MD5 95af0ddcf51cf9aad934468bd4cb71fc
BLAKE2b-256 75623831df86a8520d397d9f51bf1ac42599618b50b251e37951246aa2249241

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b109a9925e54168e9f5737a39b0b9f25916cff2109618c632b6ca7131583fc5
MD5 3e8788e3ea12fb0d15eb1067bb3106ac
BLAKE2b-256 1d9aee6f8968b075755f9a9aa54de1f6adc860e1ddd3b5bd98d5d9915f199d6c

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: qllm-0.2.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 941.9 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for qllm-0.2.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 70a402b2e2fe8c1d753657e7e43c78f306cae910071b4b417c4f3d4711e751bb
MD5 ea01d5087a0ed2a533ba4174d1f6e3df
BLAKE2b-256 57f7185e8cd155dbb342b2b741350b0618153fc54f559a6033cc59212aeb2478

See more details on using hashes here.

File details

Details for the file qllm-0.2.0-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for qllm-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1fcdf0c7ff6316ad7cd14c0d56a3e740533187219f3ae7695cef7ece3722dc71
MD5 d8641263643ff57fbe053da7205aa2ff
BLAKE2b-256 79dcee2643f5ab94af32b5975515a3ff477a25964f54aa9fc1cfaae448c084f4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page