cli tool for downloading and quantizing LLMs
Project description
quantkit
A tool for downloading and converting HuggingFace models without drama.
Install
pip3 install llm-quantkit
Usage
Usage: quantkit [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
download Download model from huggingface.
safetensor Download and/or convert a pytorch model to safetensor format.
awq Download and/or convert a model to AWQ format.
exl2 Download and/or convert a model to EXL2 format.
gguf Download and/or convert a model to GGUF format.
gptq Download and/or convert a model to GPTQ format.
The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.
The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.
AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
Examples
Download a model from HF and don't use HF cache:
quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache
Only download the safetensors version of a model (useful for models that have torch and safetensor):
quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b
Download and convert a model to safetensor, deleting the original pytorch bins:
quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original
Download and convert a model to GGUF (Q5_K):
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B Q5_K
Download and convert a model to GGUF using an imatrix, offloading 200 layers:
quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B IQ4_XS --built-in-imatrix -ngl 200
Download and convert a model to AWQ:
quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ
Convert a model to GPTQ (4 bits / group-size 32):
quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32
Convert a model to exllamav2:
quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8
Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llm_quantkit-0.21-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 919a4bccc95fe1df4f99f4f4ac7e3c41dc68f5f1ca5adafa23d7846b8466eff5 |
|
MD5 | 8d5804d330c3b5998e664dbcef70b02a |
|
BLAKE2b-256 | f8577743f9b1cb74fc4ec3c03c96b33b3d9b7bdc921d73f4278319a56e9e57a5 |