Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models

Project description

Auto-Quant-Tool

Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models. Pulls a model from HuggingFace, generates multiple quantized variants, benchmarks them on your hardware, and outputs a Pareto frontier showing the best accuracy-to-speed tradeoff.

Supported formats

GGUF (Q2 through Q8) — for llama.cpp / Ollama local inference
GPTQ (INT4, INT8) — for GPU inference via gptqmodel
TFLite (FP32, FP16, INT8) — for mobile deployment

Quick start

1. Clone the repo

git clone --recurse-submodules https://github.com/YOUR_USERNAME/auto-quant-tool.git
cd auto-quant-tool

2. Base install (all platforms)

uv sync

3. Hardware backend (run once, auto-detects your system)

python setup/install_backends.py

4. Launch the web UI

uv run python -m auto_quant_tool.cli ui

Then open http://localhost:7860 in your browser.

5. Or run via CLI

uv run python -m auto_quant_tool.cli run --config sample_llm.yaml

Installation by platform

Windows + NVIDIA GPU

uv sync
python setup/install_backends.py --backend cuda

Requires Visual C++ Build Tools for llama.cpp compilation. Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/

GPTQ quantization requires a GPU with 16GB+ VRAM. For systems with less VRAM, use the Kaggle notebook: notebooks/kaggle_gptq.ipynb

TFLite conversion is not supported on Windows. Use the Colab notebook instead: notebooks/colab_tflite.ipynb

macOS (Apple Silicon)

uv sync
python setup/install_backends.py --backend metal

Linux + NVIDIA GPU

uv sync
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

CPU only (any OS)

uv sync
python setup/install_backends.py --backend cpu

Configuration

Copy and edit a sample config:

cp sample_llm.yaml my_model.yaml

model:
  source: huggingface       # or local
  id: Qwen/Qwen2-0.5B
  modality: llm             # llm | vision | audio

quantize:
  formats: [gguf, gptq]
  gguf_levels: [Q2_K, Q4_K_M, Q5_0, Q8_0]
  gptq_levels: [int4]

benchmark:
  metrics: [perplexity, tok_s]
  full_mmlu: false
  soc_target: snapdragon_8_gen_3    # for TFLite sim benchmark
  dataset:
    name: wikitext
    split: test
    source: hf_datasets

Output structure

outputs/
├── models/          # cached HF model weights
├── gguf/            # GGUF quantized files per model
├── gptq/            # GPTQ quantized files per model
├── tflite/          # TFLite converted files per model
├── results/         # benchmark CSVs, unified JSON, Pareto HTML/PNG
└── best_model/      # knee-point model files copied here

Notebooks

notebooks/kaggle_gptq.ipynb — GPTQ quantization on Kaggle T4 (16GB VRAM)
notebooks/colab_tflite.ipynb — TFLite conversion on Google Colab

Hardware requirements

Task	Minimum	Recommended
GGUF conversion	8GB RAM	16GB RAM
GGUF inference (7B Q4)	8GB RAM	16GB RAM + any GPU
GPTQ quantization (7B)	16GB VRAM	A100 40GB
TFLite conversion	CPU only	CPU only
Simulated benchmark	CPU only	CPU only

Known limitations

TFLite conversion not supported on Windows (use Colab notebook)
GPTQ requires 16GB+ VRAM (use Kaggle notebook for smaller GPUs)
Perplexity measured on a short fixed corpus — use --full-mmlu for task-based accuracy (slower)
TurboQuant (KV cache quantization) deferred to v2

License

Apache 2.0

Project details

Release history Release notifications | RSS feed

This version

0.1.1.1

Apr 9, 2026

0.1.1

Apr 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_quant_tool-0.1.1.1.tar.gz (301.8 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

auto_quant_tool-0.1.1.1-py3-none-any.whl (31.7 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file auto_quant_tool-0.1.1.1.tar.gz.

File metadata

Download URL: auto_quant_tool-0.1.1.1.tar.gz
Upload date: Apr 9, 2026
Size: 301.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for auto_quant_tool-0.1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f085cea991d2d9a09e0b6c1065888863b7905299a347649512caade6c53f38bb`
MD5	`da6a643fa575f527ba3d1db1ebc2b4bd`
BLAKE2b-256	`f7e707367a871cb7c18adf68294208af883b0d9174d5a46e75072044888672bd`

See more details on using hashes here.

File details

Details for the file auto_quant_tool-0.1.1.1-py3-none-any.whl.

File metadata

Download URL: auto_quant_tool-0.1.1.1-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 31.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for auto_quant_tool-0.1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2396f90cef1b5b46d470a79547c696217f4678310e153378c235dd4c6d4d01ec`
MD5	`7ca8dff36c685d42576087e0654572bc`
BLAKE2b-256	`61a69b9d4e84c085c52cfbe651f6351b65dde43ab95b64bbf9e1c9958a6375ea`

See more details on using hashes here.

auto-quant-tool 0.1.1.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Auto-Quant-Tool

Supported formats

Quick start

1. Clone the repo

2. Base install (all platforms)

3. Hardware backend (run once, auto-detects your system)

4. Launch the web UI

5. Or run via CLI

Installation by platform

Windows + NVIDIA GPU

macOS (Apple Silicon)

Linux + NVIDIA GPU

CPU only (any OS)

Configuration

Output structure

Notebooks

Hardware requirements

Known limitations

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes