A general x-bit quantization engine for LLMs,[2-8] bits, awq/gptq
Project description
QLLM
Supports any LLMs in HuggingFace/Transformers, mixed bits(2-8bit), GPTQ/AWQ, ONNX exportQLLM is a out-of-box quantization toolbox for large language models, It didn't limit to a specific model, and designed to be auto-quantization layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args `--export_onnx ./onnx_model`, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.
We alread supported
- GPTQ quantization
- AWQ quantization
Features:
- GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
- for GPTQ, we support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
- for AWQ, we support only those models in llm-awq/auto-awq for now.
- we support to load model which quantized by AutoGPTQ and AutoAWQ.
- we only support Nvidia-GPU platform for now,
- we will consider support AMD-GPU.
Latest News 🔥
- [2023/12] The first PyPi package released
Installation
Easy to install qllm from PyPi [cu121]
pip install qllm
Install from release package, CUDA-118/121 is supported. [py38, py39, py310] https://github.com/wejoncy/QLLM/releases
Build from Source If you are using CUDA-121
pip install git+https://github.com/wejoncy/QLLM.git
OR CUDA-118
git clone https://github.com/wejoncy/QLLM.git
cd QLLM
python setup.py install
Dependencies
torch
: >=v2.0.0 and cu118transformers
: tested on v4.28.0.dev0onnxruntime
: tested on v1.16.3onnx
How to use it
Quantize llama2
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit
(NEW) Quantize model with mix bits/groupsize for higher precision (PPL)
# Quantize and Save compressed model
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --allow_mix_bits --true-sequential
NOTE:
- only support GPTQ
- allow_mix_bits option refered from gptq-for-llama, QLLM makes it easier to use and flexible
- wjat different with gptq-for-llama is we grow bit by one instead of times 2.
- all configurations will be saved/load automaticlly instead of quant-table which used by gptq-for-llama.
- if --allow_mix_bits is enabled, The saved model is not compatible with vLLM for now.
Quantize model for vLLM
Due to the zereos diff, we need to set a env variable if you set pack_mode to GPTQ whenver the method is awq or gptq
compatible_with_autogptq=1 python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GPTQ
If you use GEMM pack_mode, then you don't have to set the var
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit --pack_mode=GEMM
python -m qllm --model=meta-llama/Llama-2-7b-hf --method=awq --save ./Llama-2-7b-4bit --pack_mode=GEMM
Conversion between AWQ and GPTQ
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval --save ./Llama-2-7b-chat-hf_gptq_q4/ --pack_mode=GPTQ
Or you can use --pack_mode=AWQ
to convert GPTQ to AWQ.
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval --save ./Llama-2-7b-chat-hf_awq_q4/ --pack_mode=GEMM
Note:
Not all cases are supported, for example,
- if you quantized model with different quantization bits for different layers, you can't convert it to AWQ.
- if GPTQ model is quantized with
--allow_mix_bits
option, you can't convert it to AWQ. - if GPTQ model is quantized with
--act_order
option, you can't convert it to AWQ.
Convert to onnx model
use --export_onnx ./onnx_model
to export and save onnx model
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/
model inference with the saved model
python -m qllm --load ./Llama-2-7b-4bit --eval
model inference with ORT
import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'
tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/model_one_for_all.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 'attention_mask': mask, 'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
inputs[f'present_key.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
inputs[f'present_values.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session(None, inputs)
Load quantized model from hugingface/transformers
python -m qllm --load TheBloke/Llama-2-7B-Chat-AWQ --eval
python -m qllm --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
start a chatbot
you may need to install fschat and accelerate with pip
pip install fschat accelerate
use --use_plugin
to enable a chatbot plugin
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=awq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/
or
python -m qllm --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/
For some users has transformers connect issues.
Please set environment with PROXY_PORT=your http proxy port
PowerShell
$env:PROXY_PORT=1080
Bash
export PROXY_PORT=1080
windows cmd
set PROXY_PORT=1080
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for qllm-0.1.4-cp311-cp311-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3c2e1c0174bbe1427626b4ee4be78377b530a5bf67589421628ec9a04dbc1d0 |
|
MD5 | 644eb2c0d2f17dd6a6fd696f254bafe8 |
|
BLAKE2b-256 | 86e163578b3ba770dfe9d164b5b55a8dc8d1c0c0da984fb6dae10c48abec0db8 |
Hashes for qllm-0.1.4-cp311-cp311-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac977ffcf5719dc22af45d0ce0c4db20b3832d5469a865cca2d812eb93e982a3 |
|
MD5 | cf99fb0c3bfaa97d6a4539d7a6debeed |
|
BLAKE2b-256 | f0ef4afd9d2c57f69b786ae2a2886d3a29f88cabce5519fe98470d888ebc8e17 |
Hashes for qllm-0.1.4-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd67c04b0f5c30d5a3a9aa942c8b0808efd624c092535cea11c9e4e5d471e481 |
|
MD5 | b81363f90ae2202c3fa5f43b23e8a35e |
|
BLAKE2b-256 | 400fc60bf1b9fdfad808ebb48220f94cd3035e9054ffe97c92528567820d4281 |
Hashes for qllm-0.1.4-cp310-cp310-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e1a8af59f6c9f6194ce3f20690ae06e371ac342a55dd2bb7b3d57720ecfcfc19 |
|
MD5 | fbbb0b56a9f41579d5dab2811e314e5e |
|
BLAKE2b-256 | c4b0df9957342af46ebc6aa8821df68dc4cf71c5bf55a7bacd3c0eb5a31c68a3 |
Hashes for qllm-0.1.4-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6b200ec62e103fa16f70997566d4985fb905b2a2763588b5c94bbb1938c14a7 |
|
MD5 | 3f2f2808788f7720d14413c07d2aa6e6 |
|
BLAKE2b-256 | d5de8a6844b54961399ae623905a6135e17ead9c2722353164952bde98e82be3 |
Hashes for qllm-0.1.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2723db4ddcac21f423e287ee12fb6f5e3105ee435c5867e0700ca93711291a62 |
|
MD5 | 90d5819f0862ceed435e8114a6b218d3 |
|
BLAKE2b-256 | 22f741fa239143198e87f6d53ce73d171521c3bee4884378bc928d4c53c2dfca |
Hashes for qllm-0.1.4-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8ab82cf7821f4b65f2df2306dbaf94c59528725f5205b432e485dcb4db18f75 |
|
MD5 | 76984741f3f960e6a7e5eb575ecb257d |
|
BLAKE2b-256 | 62409fc9ee40ef75a4b6a99858abf8090bd46bee140bc64e31e797493cb5f02e |
Hashes for qllm-0.1.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a423faa0b3ce8028bd13c37d304ddbfd0a59035e006a15aedc0faeb62f555d4e |
|
MD5 | 53c09258813504a778e8f7696a8f009b |
|
BLAKE2b-256 | f44bbf5c3e159934dd7612cc24ce79c7f046fea6a7094afc507bb7125826616d |