cli tool for downloading and quantizing LLMs
Project description
quantkit
A tool for downloading and converting HuggingFace models without drama.
Install
If you're on a machine with an NVIDIA/CUDA GPU and want AWQ/GPTQ support:
pip3 install llm-quantkit[cuda]
Otherwise, the default install works.
pip3 install llm-quantkit
Requirements
If you need a device specific torch, install it first.
This project depends on torch, awq, exl2, gptq, and hqq libraries.
Some of these dependencies do not support Python 3.12 yet.
Supported Pythons: 3.8, 3.9, 3.10, and 3.11
Usage
Usage: quantkit [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
download Download model from huggingface.
safetensor Download and/or convert a pytorch model to safetensor format.
awq Download and/or convert a model to AWQ format.
exl2 Download and/or convert a model to EXL2 format.
gguf Download and/or convert a model to GGUF format.
gptq Download and/or convert a model to GPTQ format.
hqq Download and/or convert a model to HQQ format.
The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.
The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.
AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
HQQ defaults to 4 bits, group size 64, zero_point=True.
Examples
Download a model from HF and don't use HF cache:
quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache
Only download the safetensors version of a model (useful for models that have torch and safetensor):
quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b
Download from specific revision of a huggingface repo:
uantkit download turboderp/TinyLlama-1B-32k-exl2 --branch 6.0bpw --no-cache -out TinyLlama-1B-32k-exl2-b6
Download and convert a model to safetensor, deleting the original pytorch bins:
quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original
Download and convert a model to GGUF (Q5_K):
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-Q5_K.gguf Q5_K
Download and convert a model to GGUF using an imatrix, offloading 200 layers:
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-IQ4_XS.gguf IQ4_XS --built-in-imatrix -ngl 200
Download and convert a model to AWQ:
quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ
Convert a model to GPTQ (4 bits / group-size 32):
quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32
Convert a model to exllamav2:
quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8
Convert a model to HQQ:
quantkit hqq mistralai/Mistral-7B-v0.1 -out Mistral-7B-HQQ-w4-gs64
Hardware Requirements
Here's what has worked for me in testing. Drop a PR or Issue with updates for what is possible on various size cards.
GGUF conversion doesn't need a GPU except for iMatrix and Exllamav2 requires that the largest layer fits on single GPU.
Model Size | Quant | VRAM | Successful |
---|---|---|---|
7B | AWQ | 24GB | ✅ |
7B | EXL2 | 24GB | ✅ |
7B | GGUF | 24GB | ✅ |
7B | GPTQ | 24GB | ✅ |
7B | HQQ | 24GB | ✅ |
13B | AWQ | 24GB | ✅ |
13B | EXL2 | 24GB | ✅ |
13B | GGUF | 24GB | ✅ |
13B | GPTQ | 24GB | :x: |
13B | HQQ | 24GB | ? |
34B | AWQ | 24GB | :x: |
34B | EXL2 | 24GB | ✅ |
34B | GGUF | 24GB | ✅ |
34B | GPTQ | 24GB | :x: |
34B | HQQ | 24GB | ? |
70B | AWQ | 24GB | :x: |
70B | EXL2 | 24GB | ✅ |
70B | GGUF | 24GB | ✅ |
70B | GPTQ | 24GB | :x: |
70B | HQQ | 24GB | ? |
Notes
Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation and follow the relevant command but replace llama-cpp-python with llama-cpp-conv.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file llm_quantkit-0.29.tar.gz
.
File metadata
- Download URL: llm_quantkit-0.29.tar.gz
- Upload date:
- Size: 52.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e51770d64ca0a7d95e9d2d6cf54a92f314460d0006d447c0ec06700a63ff528b |
|
MD5 | e8bc3926d1160faa9d1d76a4fccca881 |
|
BLAKE2b-256 | 764bc6c78ca7d44575a55ff92ce8927a27f96ba38f5220c55e1ba1fda6c93fd4 |
File details
Details for the file llm_quantkit-0.29-py3-none-any.whl
.
File metadata
- Download URL: llm_quantkit-0.29-py3-none-any.whl
- Upload date:
- Size: 54.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfea6833c9cb77ac3da37002e9848fcdd149f585ec6cf559fe26cf4edc76d5b6 |
|
MD5 | fa17b7c1b5eae9546ab93332ad6760ad |
|
BLAKE2b-256 | a2373a5f0c9fd35c1fd63ece5552f75de2fdaa01fd4fe9f35bf57a14529ebd00 |