A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq/hqq
Project description
QLLM
Supports any LLMs in HuggingFace/Transformers, mixed bits(2-8bit), GPTQ/AWQ/HQQ, ONNX exportQLLM is a out-of-box quantization toolbox for large language models, It didn't limit to a specific model, and designed to be auto-quantization layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ/HQQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.
We alread supported
- GPTQ quantization
- AWQ quantization
- HQQ quantization
Features:
- GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
- We support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
- Auto promoting bits/group-size for better accuracy
Latest News 🔥
- [2024/01] Support HQQ algorithm
- [2023/12] The first PyPi package released
Installation
Easy to install qllm from PyPi [cu121]
pip install qllm
Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases
Build from Source
Please set ENV EXCLUDE_EXTENTION_FOR_FAST_BUILD=1 for fast build
If you are using CUDA-121
pip install git+https://github.com/wejoncy/QLLM.git
OR CUDA-118/117
git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install
How to use it
Quantize llama2
# Quantize and Save compressed model, method can be one of [gptq/awq/hqq]
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --nsamples=64 --wbits=4 --groupsize=128 --save ./Llama-2-7b-4bit
(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential
NOTE:
- only support GPTQ
- allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
- wjat different with gptq-for-llama is we grow bit by one instead of times 2.
- all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
- if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.
Quantize model for vLLM
Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq
COMPATIBLE_WITH_AUTOGPTQ=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ
If you use GEMM pack_mode, then you don't have to set the var
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM
Conversion between AWQ and GPTQ
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ
Or you can use --pack_mode=AWQ
to convert GPTQ to AWQ.
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM
Note:
Not all cases are supported, for example,
- if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
- if GPTQ model is quantized with
--allow_mix_bits
option, you can't convert it to AWQ. - if GPTQ model is quantized with
--act_order
option, you can't convert it to AWQ.
Convert to onnx model
use --export_onnx ./onnx_model
to export and save onnx model
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/
or you can convert a existing model in HF Hub
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --export_onnx=./onnx
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --export_onnx=./onnx
model inference with the saved model
python -m qllm --load ./Llama-2-7b-4bit --eval
model inference with ORT
you may want to use genai to do generation with ORT.
import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'
tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/decoder_merged.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(),
'attention_mask': mask,
'position_ids': np.arrange(0,sample_inputs[0], dtype=np.int64),
'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
inputs[f'past_key_values.{i}.key'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
inputs[f'past_key_values.{i}.value'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)
Load quantized model from hugingface/transformers
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ --use_plugin
start a chatbot
you may need to install fschat and accelerate with pip
pip install fschat accelerate
use --use_plugin
to enable a chatbot plugin
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=awq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/
or
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/
For some users has transformers connect issues.
Please set environment with PROXY_PORT=your http proxy port
PowerShell
$env:PROXY_PORT=1080
Bash
export PROXY_PORT=1080
windows cmd
set PROXY_PORT=1080
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for qllm-0.1.7.1-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02546489b6ff0cecb10feecbe5020f60c28075050c3f95bdb5092be4cae68580 |
|
MD5 | 2c66caca3102d70c72052e5508478e38 |
|
BLAKE2b-256 | 698b399b638754a8cf81d55a1704de58118716b594a561807f36d8cd45783605 |
Hashes for qllm-0.1.7.1-cp311-cp311-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83d2e6410b09c6ef42be72a741fc9bd1f9d4ea6ce459b744a13a03f5c5b6ac7d |
|
MD5 | 708407aa9c64123b0eafc874b89318dc |
|
BLAKE2b-256 | d833b71d93e55b48d451a27500b52aa70dd449b17d1c95ee6970364cfd2f611a |
Hashes for qllm-0.1.7.1-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57e294859ccf210819ff230a70b50ecec810e31d7627529de793e80d8ac11b43 |
|
MD5 | e64db54cde6a74f89d89dfc4ba184f80 |
|
BLAKE2b-256 | 830ec6b5f8f18ccc120aa8b127527213c2d78beca765e21531f155e562c0ea2f |
Hashes for qllm-0.1.7.1-cp310-cp310-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2df7082182b4f58354886bc2e40253990b0fa7a1879e40d8c66af73df240f67f |
|
MD5 | 82c49e6c305a1c1b20ff45ba75b1e10d |
|
BLAKE2b-256 | 3bab1e3b0c2ca45a6dbbffeb13f9f874f6f445c7f8d36768e923abcbf175ff4d |
Hashes for qllm-0.1.7.1-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7faedf85b07a000c13e9c9a95a020200267030d885dae47ad5c47e8a6c85cb5a |
|
MD5 | aaa0142fc6a04ada51ce08e1a9c91ccf |
|
BLAKE2b-256 | 61fc6eabcafbd1ce1c1c40fcc05676ce3224812b1813ad25f79a9f6e8fbd2896 |
Hashes for qllm-0.1.7.1-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f0357092bcf7d510145f71779da99cf39b55eb06a7066e70a08fb37cbfc6eec1 |
|
MD5 | 0b6efb5e841c034fdad2aa4f201b1d73 |
|
BLAKE2b-256 | 8b11c7a23c7b3af48c37279fb23a294b58c799a9ff71812963e378fe0690540c |
Hashes for qllm-0.1.7.1-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 251b15212bad2b4d852d3f41aaab5982ddb522d2e59e8b7468bf2047bc567fc3 |
|
MD5 | 7a6c168089f95759107c3788989ca6c1 |
|
BLAKE2b-256 | ddffa81f1e90bba694f8ec8a3cfa78badcfcdc4e3a063ea242617a4e5f565829 |
Hashes for qllm-0.1.7.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8d6f0fd2699ae2aafca061041a03a9d3b154bc7d3d367d2937bbedafa491f32 |
|
MD5 | 0d912cc868ff0281ea97caa86a5da392 |
|
BLAKE2b-256 | d20049e6090713532ce30bb445d6cbb7f7840571d466a15560419def5da18be8 |