cli tool for downloading and quantizing LLMs

Project description

quantkit

A tool for downloading and converting HuggingFace models without drama.

Install

If you're on a machine with an NVIDIA/CUDA GPU and want AWQ/GPTQ support:

pip3 install llm-quantkit[cuda]

Otherwise, the default install works.

pip3 install llm-quantkit

Requirements

If you need a device specific torch, install it first.

This project depends on torch, awq, exl2, gptq, and hqq libraries.
Some of these dependencies do not support Python 3.12 yet.
Supported Pythons: 3.8, 3.9, 3.10, and 3.11

Usage

Usage: quantkit [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  download    Download model from huggingface.
  safetensor  Download and/or convert a pytorch model to safetensor format.
  awq         Download and/or convert a model to AWQ format.
  exl2        Download and/or convert a model to EXL2 format.
  gguf        Download and/or convert a model to GGUF format.
  gptq        Download and/or convert a model to GPTQ format.
  hqq         Download and/or convert a model to HQQ format.

The first argument after command should be an HF repo id (mistralai/Mistral-7B-v0.1) or a local directory with model files in it already.

The download command defaults to downloading into the HF cache and producing symlinks in the output dir, but there is a --no-cache option which places the model files in the output directory.

AWQ defaults to 4 bits, group size 128, zero-point True.
GPTQ defaults are 4 bits, group size 128, activation-order False.
EXL2 defaults to 8 head bits but there is no default bitrate.
GGUF defaults to no imatrix but there is no default quant-type.
HQQ defaults to 4 bits, group size 64, zero_point=True.

Examples

Download a model from HF and don't use HF cache:

quantkit download teknium/Hermes-Trismegistus-Mistral-7B --no-cache

Only download the safetensors version of a model (useful for models that have torch and safetensor):

quantkit download mistralai/Mistral-7B-v0.1 --no-cache --safetensors-only -out mistral7b

Download from specific revision of a huggingface repo:

uantkit download turboderp/TinyLlama-1B-32k-exl2 --branch 6.0bpw --no-cache -out TinyLlama-1B-32k-exl2-b6

Download and convert a model to safetensor, deleting the original pytorch bins:

quantkit safetensor migtissera/Tess-10.7B-v1.5b --delete-original

Download and convert a model to GGUF (Q5_K):

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-Q5_K.gguf Q5_K

Download and convert a model to GGUF using an imatrix, offloading 200 layers:

quantkit gguf TinyLlama/TinyLlama-1.1B-Chat-v1.0 -out TinyLlama-1.1B-IQ4_XS.gguf IQ4_XS --built-in-imatrix -ngl 200

Download and convert a model to AWQ:

quantkit awq mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-AWQ

Convert a model to GPTQ (4 bits / group-size 32):

quantkit gptq mistral7b -out Mistral-7B-v0.1-GPTQ -b 4 --group-size 32

Convert a model to exllamav2:

quantkit exl2 mistralai/Mistral-7B-v0.1 -out Mistral-7B-v0.1-exl2-b8-h8 -b 8 -hb 8

Convert a model to HQQ:

quantkit hqq mistralai/Mistral-7B-v0.1 -out Mistral-7B-HQQ-w4-gs64

Hardware Requirements

Here's what has worked for me in testing. Drop a PR or Issue with updates for what is possible on various size cards.
GGUF conversion doesn't need a GPU except for iMatrix and Exllamav2 requires that the largest layer fits on single GPU.

Model Size	Quant	VRAM	Successful
7B	AWQ	24GB	✅
7B	EXL2	24GB	✅
7B	GGUF	24GB	✅
7B	GPTQ	24GB	✅
7B	HQQ	24GB	✅
13B	AWQ	24GB	✅
13B	EXL2	24GB	✅
13B	GGUF	24GB	✅
13B	GPTQ	24GB	:x:
13B	HQQ	24GB	?
34B	AWQ	24GB	:x:
34B	EXL2	24GB	✅
34B	GGUF	24GB	✅
34B	GPTQ	24GB	:x:
34B	HQQ	24GB	?
70B	AWQ	24GB	:x:
70B	EXL2	24GB	✅
70B	GGUF	24GB	✅
70B	GPTQ	24GB	:x:
70B	HQQ	24GB	?

Notes

Still in beta. Llama.cpp offloading is probably not going to work on your platform unless you uninstall llama-cpp-conv and reinstall it with the proper build flags. Look at the llama-cpp-python documentation and follow the relevant command but replace llama-cpp-python with llama-cpp-conv.

Project details

Release history Release notifications | RSS feed

This version

0.29

Jul 4, 2024

0.27

May 21, 2024

0.26

Apr 24, 2024

0.25

Apr 12, 2024

0.24

Apr 6, 2024

0.23

Apr 2, 2024

0.22

Mar 21, 2024

0.21

Mar 20, 2024

0.20

Mar 20, 2024

0.19

Mar 20, 2024

0.18

Mar 20, 2024

0.17

Mar 15, 2024

0.16

Mar 14, 2024

0.15

Mar 13, 2024

0.14

Mar 13, 2024

0.13

Mar 13, 2024

0.12

Mar 13, 2024

0.11

Mar 13, 2024

0.1

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_quantkit-0.29.tar.gz (52.3 kB view hashes)

Uploaded Jul 4, 2024 Source

Built Distribution

llm_quantkit-0.29-py3-none-any.whl (54.7 kB view hashes)

Uploaded Jul 4, 2024 Python 3

Hashes for llm_quantkit-0.29.tar.gz

Hashes for llm_quantkit-0.29.tar.gz
Algorithm	Hash digest
SHA256	`e51770d64ca0a7d95e9d2d6cf54a92f314460d0006d447c0ec06700a63ff528b`
MD5	`e8bc3926d1160faa9d1d76a4fccca881`
BLAKE2b-256	`764bc6c78ca7d44575a55ff92ce8927a27f96ba38f5220c55e1ba1fda6c93fd4`

Hashes for llm_quantkit-0.29-py3-none-any.whl

Hashes for llm_quantkit-0.29-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dfea6833c9cb77ac3da37002e9848fcdd149f585ec6cf559fe26cf4edc76d5b6`
MD5	`fa17b7c1b5eae9546ab93332ad6760ad`
BLAKE2b-256	`a2373a5f0c9fd35c1fd63ece5552f75de2fdaa01fd4fe9f35bf57a14529ebd00`