A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq
Project description
QLLM
KeyWords Quantization, GPTQ,AWQ, HQQ, ONNX, ONNXRuntime, VLLM
Quantize all LLMs in HuggingFace/Transformers with GPTQ/AWQ/HQQ in mixed bits(2-8bit), and export to onnx modelQLLM is a out-of-box quantization toolbox for large language models, It is designed to be a auto-quantization framework which takes layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.
We alread supported
- GPTQ quantization
- AWQ quantization
- HQQ quantization
Features:
- GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
- We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
- Auto promoting bits/group-size for better accuracy
- Export to ONNX model, Running by OnnxRuntime
Latest News 🔥
- [2024/03] ONNX Models export API
- [2024/01] Support HQQ algorithm
- [2023/12] The first PyPi package released
Installation
Easy to install qllm from PyPi [cu121]
pip install qllm
Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases
Build from Source
Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build
If you are using CUDA-121
pip install git+https://github.com/wejoncy/QLLM.git --no-build-isolation
OR CUDA-118/117
git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install
How to use it
Quantize llama2
# Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --dataset=pileval --nsamples=16 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=hqq --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
Convert to onnx model
use --export_onnx ./onnx_model
to export and save onnx model
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --quant_method=gptq --dataset=pileval --nsamples=16 --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/
or you can convert a existing model in HF Hub
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx
(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential
NOTE:
- only support GPTQ
- allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
- wjat different with gptq-for-llama is we grow bit by one instead of times 2.
- all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
- if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.
Quantize model for vLLM
Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq
COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ
If you use GEMM pack_mode, then you don't have to set the var
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --quant_method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM
Conversion among AWQ, GPTQ and MarLin
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ
Or you can use --pack_mode=AWQ
to convert GPTQ to AWQ.
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM
Or you can use --pack_mode=MARLIN
to convert GPTQ to Marlin.
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN
Or you can use --pack_mode=MARLIN
to convert AWQ to Marlin.
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_marlin_q4/ --pack_mode=MARLIN
Note:
Not all cases are supported, for example,
- if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
- if GPTQ model is quantized with
--allow_mix_bits
option, you can't convert it to AWQ. - if GPTQ model is quantized with
--act_order
option, you can't convert it to AWQ.
model inference with the saved model
python -m qllm --load ./Llama-2-7b-4bit --eval
model inference with ORT
you may want to use genai to do generation with ORT.
import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'
tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(),
'attention_mask': mask,
'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)
Load quantized model from hugingface/transformers
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ --use_plugin
start a chatbot
you may need to install fschat and accelerate with pip
pip install fschat accelerate
use --use_plugin
to enable a chatbot plugin
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --quant_method=awq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/
or
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --quant_method=gptq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/
use QLLM with API
from qllm import AutoModelQuantization
quantizer = AutoModelQuantization()
q_model = quantizer.api_quantize(model_or_model_path='meta-llama/Llama-2-7b-hf', method='gptq', wbits=4, groupsize=128)
OR
from qllm import AutoModelQuantization
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16)
q_model = quantizer.api_quantize(model_or_model_path=q_model, method='gptq', wbits=4, groupsize=128)
For some users has transformers connect issues.
Please set environment with PROXY_PORT=your http proxy port
PowerShell
$env:PROXY_PORT=1080
Bash
export PROXY_PORT=1080
windows cmd
set PROXY_PORT=1080
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file qllm-0.2.0-cp311-cp311-win_amd64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 944.4 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3b9e4f35da64bcb03a6b7c54bf8d54ef1dd8a1e57391945b009981a8199c1d0 |
|
MD5 | 7c4b78abcd1b7c1377aed7bcc81c73fc |
|
BLAKE2b-256 | 22bbc99546cfa4b389d62210da20431f1c98afd63a55cd47dd5fa151858da243 |
File details
Details for the file qllm-0.2.0-cp311-cp311-manylinux2014_x86_64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp311-cp311-manylinux2014_x86_64.whl
- Upload date:
- Size: 922.8 kB
- Tags: CPython 3.11
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a46dabdd36f2a8979c9e2c64550db3059eeb7fa5f451c49512e250e11b24c241 |
|
MD5 | 2d16ea3df5bf4d3fbe01f69ed292bfa1 |
|
BLAKE2b-256 | a576e8553c5c5debdb58179cc71d95c303e8273d1927d8ad18db30f44c092eed |
File details
Details for the file qllm-0.2.0-cp310-cp310-win_amd64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp310-cp310-win_amd64.whl
- Upload date:
- Size: 941.6 kB
- Tags: CPython 3.10, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49ea032119e5b8f6b58dccc9b34a371fd96eeb659fc468f2ea477f0f26134150 |
|
MD5 | b603d9c43a8a315d4f748aef5a4a9345 |
|
BLAKE2b-256 | 82d0d4a1fe3ad053772ac472ec513f24ddd2950e1a80f74c4db987bae111e139 |
File details
Details for the file qllm-0.2.0-cp310-cp310-manylinux2014_x86_64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp310-cp310-manylinux2014_x86_64.whl
- Upload date:
- Size: 921.5 kB
- Tags: CPython 3.10
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0736e2a3e7fbb791a11edb55d864940751d93cb265b65bc72e426b3a38b3c04d |
|
MD5 | 69c72f833b03f1ab48501f4eb82e04fa |
|
BLAKE2b-256 | d52549a9a81397fee7965b1142912805eeebcd50f14396723d424bf2e7bb40ee |
File details
Details for the file qllm-0.2.0-cp39-cp39-win_amd64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp39-cp39-win_amd64.whl
- Upload date:
- Size: 941.3 kB
- Tags: CPython 3.9, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 79af262bf9f16cf89b946d1322f1ed08aeab0e6b67e80fb2f5bbaad34dfde884 |
|
MD5 | 95af0ddcf51cf9aad934468bd4cb71fc |
|
BLAKE2b-256 | 75623831df86a8520d397d9f51bf1ac42599618b50b251e37951246aa2249241 |
File details
Details for the file qllm-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp39-cp39-manylinux2014_x86_64.whl
- Upload date:
- Size: 920.9 kB
- Tags: CPython 3.9
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b109a9925e54168e9f5737a39b0b9f25916cff2109618c632b6ca7131583fc5 |
|
MD5 | 3e8788e3ea12fb0d15eb1067bb3106ac |
|
BLAKE2b-256 | 1d9aee6f8968b075755f9a9aa54de1f6adc860e1ddd3b5bd98d5d9915f199d6c |
File details
Details for the file qllm-0.2.0-cp38-cp38-win_amd64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp38-cp38-win_amd64.whl
- Upload date:
- Size: 941.9 kB
- Tags: CPython 3.8, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70a402b2e2fe8c1d753657e7e43c78f306cae910071b4b417c4f3d4711e751bb |
|
MD5 | ea01d5087a0ed2a533ba4174d1f6e3df |
|
BLAKE2b-256 | 57f7185e8cd155dbb342b2b741350b0618153fc54f559a6033cc59212aeb2478 |
File details
Details for the file qllm-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
.
File metadata
- Download URL: qllm-0.2.0-cp38-cp38-manylinux2014_x86_64.whl
- Upload date:
- Size: 920.9 kB
- Tags: CPython 3.8
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.17
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fcdf0c7ff6316ad7cd14c0d56a3e740533187219f3ae7695cef7ece3722dc71 |
|
MD5 | d8641263643ff57fbe053da7205aa2ff |
|
BLAKE2b-256 | 79dcee2643f5ab94af32b5975515a3ff477a25964f54aa9fc1cfaae448c084f4 |