cli tool for downloading and quantizing LLMs
Project description
quantkit
A tool for downloading and converting HuggingFace models without drama.
Install
pip3 install llm-quantkit
Requirements
This project depends on torch, awq, exl2, gptq, and hqq libraries, some of which are not compatible with Python 3.12.
If you need a device specific torch, install it first.
Python: 3.8, 3.9, 3.10, and 3.11
Usage
Usage: quantkit [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
download Download model from huggingface.
safetensor Download and/or convert a pytorch model to safetensor format.
awq Download and/or convert a model to AWQ format.
exl2 Download and/or convert a model to EXL2 format.
gguf Download and/or convert a model to GGUF format.
gptq Download and/or convert a model to GPTQ format.
hqq Download and/or convert a model to HQQ format.
The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.
The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.
AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
Examples
Download a model from HF and don't use HF cache:
quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache
Only download the safetensors version of a model (useful for models that have torch and safetensor):
quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b
Download and convert a model to safetensor, deleting the original pytorch bins:
quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original
Download and convert a model to GGUF (Q5_K):
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-Q5_K.gguf Q5_K
Download and convert a model to GGUF using an imatrix, offloading 200 layers:
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-IQ4_XS.gguf IQ4_XS --built-in-imatrix -ngl 200
Download and convert a model to AWQ:
quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ
Convert a model to GPTQ (4 bits / group-size 32):
quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32
Convert a model to exllamav2:
quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8
Convert a model to HQQ:
quantkit hqq mistralai/Mistral-7B-v0.1 -out Mistral-7B-HQQ-w4-gs64
Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation and follow the revelant command but replace llama-cpp-python with llama-cpp-conv.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llm_quantkit-0.22-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffd4007f4934850b13d540b75b57cab91b93eb986a3dbeac13ccc14da8aaf2e2 |
|
MD5 | 36f2a33e9c9a066b85f94ecffccc5b57 |
|
BLAKE2b-256 | 8cc2eb07e6a498954e2d52890ca5f8cab5e518c86f153dd3ba0cf85029f915a9 |