Skip to main content

cli tool for downloading and quantizing LLMs

Project description

quantkit

A tool for downloading and converting HuggingFace models without drama.


Install

If you're on a machine with an NVIDIA/CUDA GPU and want AWQ/GPTQ support:

pip3 install llm-quantkit[cuda]

Otherwise, the default install works.

pip3 install llm-quantkit

Requirements

If you need a device specific torch, install it first.

This project depends on torch, awq, exl2, gptq, and hqq libraries.
Some of these dependencies do not support Python 3.12 yet.
Supported Pythons: 3.8, 3.9, 3.10, and 3.11

Usage

Usage: quantkit [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  download    Download model from huggingface.
  safetensor  Download and/or convert a pytorch model to safetensor format.
  awq         Download and/or convert a model to AWQ format.
  exl2        Download and/or convert a model to EXL2 format.
  gguf        Download and/or convert a model to GGUF format.
  gptq        Download and/or convert a model to GPTQ format.
  hqq         Download and/or convert a model to HQQ format.

The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.

The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.

AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
HQQ defaults to 4 bits, group size 64, zero_point=True.


Examples

Download a model from HF and don't use HF cache:

quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache

Only download the safetensors version of a model (useful for models that have torch and safetensor):

quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b

Download from specific revision of a huggingface repo:

uantkit download turboderp/TinyLlama-1B-32k-exl2 --branch 6.0bpw --no-cache -out TinyLlama-1B-32k-exl2-b6

Download and convert a model to safetensor, deleting the original pytorch bins:

quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original

Download and convert a model to GGUF (Q5_K):

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-Q5_K.gguf Q5_K

Download and convert a model to GGUF using an imatrix, offloading 200 layers:

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-IQ4_XS.gguf IQ4_XS --built-in-imatrix -ngl 200

Download and convert a model to AWQ:

quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ

Convert a model to GPTQ (4 bits / group-size 32):

quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32

Convert a model to exllamav2:

quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8

Convert a model to HQQ:

quantkit hqq mistralai/Mistral-7B-v0.1 -out Mistral-7B-HQQ-w4-gs64

Hardware Requirements

Here's what has worked for me in testing. Drop a PR or Issue with updates for what is possible on various size cards.
GGUF conversion doesn't need a GPU except for iMatrix and Exllamav2 requires that the largest layer fits on single GPU.

Model Size Quant VRAM Successful
7B AWQ 24GB
7B EXL2 24GB
7B GGUF 24GB
7B GPTQ 24GB
7B HQQ 24GB
13B AWQ 24GB
13B EXL2 24GB
13B GGUF 24GB
13B GPTQ 24GB :x:
13B HQQ 24GB ?
34B AWQ 24GB :x:
34B EXL2 24GB
34B GGUF 24GB
34B GPTQ 24GB :x:
34B HQQ 24GB ?
70B AWQ 24GB :x:
70B EXL2 24GB
70B GGUF 24GB
70B GPTQ 24GB :x:
70B HQQ 24GB ?

Notes

Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation and follow the relevant command but replace llama-cpp-python with llama-cpp-conv.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_quantkit-0.29.tar.gz (52.3 kB view details)

Uploaded Source

Built Distribution

llm_quantkit-0.29-py3-none-any.whl (54.7 kB view details)

Uploaded Python 3

File details

Details for the file llm_quantkit-0.29.tar.gz.

File metadata

  • Download URL: llm_quantkit-0.29.tar.gz
  • Upload date:
  • Size: 52.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for llm_quantkit-0.29.tar.gz
Algorithm Hash digest
SHA256 e51770d64ca0a7d95e9d2d6cf54a92f314460d0006d447c0ec06700a63ff528b
MD5 e8bc3926d1160faa9d1d76a4fccca881
BLAKE2b-256 764bc6c78ca7d44575a55ff92ce8927a27f96ba38f5220c55e1ba1fda6c93fd4

See more details on using hashes here.

File details

Details for the file llm_quantkit-0.29-py3-none-any.whl.

File metadata

  • Download URL: llm_quantkit-0.29-py3-none-any.whl
  • Upload date:
  • Size: 54.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for llm_quantkit-0.29-py3-none-any.whl
Algorithm Hash digest
SHA256 dfea6833c9cb77ac3da37002e9848fcdd149f585ec6cf559fe26cf4edc76d5b6
MD5 fa17b7c1b5eae9546ab93332ad6760ad
BLAKE2b-256 a2373a5f0c9fd35c1fd63ece5552f75de2fdaa01fd4fe9f35bf57a14529ebd00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page