A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq
Project description
QLLM
Supports any LLMs in HuggingFace/Transformers, mixed bits(2-8bit), GPTQ/AWQ, ONNX exportQLLM is a out-of-box quantization toolbox for large language models, It didn't limit to a specific model, and designed to be auto-quantization layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.
We alread supported
- GPTQ quantization
- AWQ quantization
Features:
- GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
- for GPTQ, we support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
- for AWQ, we support only those models in llm-awq/auto-awq for now.
- we support to load model which quantized by AutoGPTQ and AutoAWQ.
- we only support Nvidia-GPU platform for now,
- we will consider support AMD-GPU.
Latest News 🔥
- [2023/12] The first PyPi package released
Installation
Easy to install qllm from PyPi [cu121]
pip install qllm
Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases
Build from Source If you are using CUDA-121
pip install git+https://github.com/wejoncy/QLLM.git
OR CUDA-118
git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install
Dependencies
torch
: >=v2.0.0 and cu118transformers
: tested on v4.28.0.dev0onnxruntime
: tested on v1.16.3onnx
How to use it
Quantize llama2
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit
(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential
NOTE:
- only support GPTQ
- allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
- wjat different with gptq-for-llama is we grow bit by one instead of times 2.
- all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
- if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.
Quantize model for vLLM
Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq
compatible_with_autogptq=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ
If you use GEMM pack_mode, then you don't have to set the var
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM
Conversion between AWQ and GPTQ
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ
Or you can use --pack_mode=AWQ
to convert GPTQ to AWQ.
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM
Note:
Not all cases are supported, for example,
- if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
- if GPTQ model is quantized with
--allow_mix_bits
option, you can't convert it to AWQ. - if GPTQ model is quantized with
--act_order
option, you can't convert it to AWQ.
Convert to onnx model
use --export_onnx ./onnx_model
to export and save onnx model
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/
model inference with the saved model
python -m qllm --load ./Llama-2-7b-4bit --eval
model inference with ORT
import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'
tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/model_one_for_all.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 'attention_mask': mask, 'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
inputs[f'present_key.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
inputs[f'present_values.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)
Load quantized model from hugingface/transformers
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
python -m qllm --load TheBloke/Mixtral-8x7B-v0.1-GPTQ --use_plugin
start a chatbot
you may need to install fschat and accelerate with pip
pip install fschat accelerate
use --use_plugin
to enable a chatbot plugin
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=awq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/
or
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/
For some users has transformers connect issues.
Please set environment with PROXY_PORT=your http proxy port
PowerShell
$env:PROXY_PORT=1080
Bash
export PROXY_PORT=1080
windows cmd
set PROXY_PORT=1080
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for qllm-0.1.5-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 945fdbba9feb162299bbf7f807ec65cd6004e2167ecd802f4a62783e08c23d06 |
|
MD5 | dabc191a242864c5348f99810cff7241 |
|
BLAKE2b-256 | 949d7515193ccb3fe28fad403d693ce6e00d6cb2324193e65db71e0ac01f62b0 |
Hashes for qllm-0.1.5-cp311-cp311-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d01851a99ba88319721f59ebd5ed50b82e359b4c5160d79d32f239af2055e98 |
|
MD5 | d5237321903617cfdda6086ed740107e |
|
BLAKE2b-256 | 4ed65dc0f30dbae5b335ca4e294a959cc446c4401b7766eadb35e6e58b97c43e |
Hashes for qllm-0.1.5-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 021e6e16593c993a84b93776884f2dbf1de057775ab1a91472623b10d86df69b |
|
MD5 | 5e6072f8df6240c9346b7e193917a323 |
|
BLAKE2b-256 | 8f10576bca8bfce7353eaca298e9bd3327867bae982a8605b2540acd70d3b6bb |
Hashes for qllm-0.1.5-cp310-cp310-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e5b5537914ee8fd260563c96d621f9a7bc3d5442165006b427818308f0ee0849 |
|
MD5 | f8a496a8b01c5589568e6d3e0fe92830 |
|
BLAKE2b-256 | 294dad8c1dacb570e36261e9ee592ae8d48d328735ae9563670cc8f6e416cdf5 |
Hashes for qllm-0.1.5-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aec35d86f77604f8ab6008d11797d4ee17c0adbe4ddf1928efb4ea6f82adc1da |
|
MD5 | 55f80681c206f80cc26611b95056949e |
|
BLAKE2b-256 | 7b09a032b6716a84a8fff8ca002c16760acfaf752dba8575b990b471332680ed |
Hashes for qllm-0.1.5-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7a0f83d03ab4681aa4b3f8a6522c0261334e98fe98b5ef68a33f1defe231479a |
|
MD5 | 09fb57d745abb0912cc77873be1ca3d7 |
|
BLAKE2b-256 | 07a1066b0c36297b7534ac717beccbec8691fcf5fdb12705b63821d53acb9456 |
Hashes for qllm-0.1.5-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b95899b01760b3e2ef11bf2cdc60f9708528bf0942e98ecbc2ad74293d8e3481 |
|
MD5 | 1492085b0ffb0b986576ce4eaf0c9a38 |
|
BLAKE2b-256 | b661f2961796c85b7ea91fd966a9cf682ff15f19c2577a1701bcc7c1532b00da |
Hashes for qllm-0.1.5-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4723e63a8bff5d2bce960deb1d314cfc2f0b25b5e3fa62d443dfe409e41636fb |
|
MD5 | a9b8eb4f2f38e7719cbac97b5f85a151 |
|
BLAKE2b-256 | 4775ed7f586e6da4d6bd8c058eb094224aa3388c91b4afdabf470a2052fa6bd2 |