Skip to main content

Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models

Project description

Auto-Quant-Tool

Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models. Pulls a model from HuggingFace, generates multiple quantized variants, benchmarks them on your hardware, and outputs a Pareto frontier showing the best accuracy-to-speed tradeoff.

Supported formats

  • GGUF (Q2 through Q8) — for llama.cpp / Ollama local inference
  • GPTQ (INT4, INT8) — for GPU inference via gptqmodel
  • TFLite (FP32, FP16, INT8) — for mobile deployment

Quick start

1. Clone the repo

git clone --recurse-submodules https://github.com/YOUR_USERNAME/auto-quant-tool.git
cd auto-quant-tool

2. Base install (all platforms)

uv sync

3. Hardware backend (run once, auto-detects your system)

python setup/install_backends.py

4. Launch the web UI

uv run python -m auto_quant_tool.cli ui

Then open http://localhost:7860 in your browser.

5. Or run via CLI

uv run python -m auto_quant_tool.cli run --config sample_llm.yaml

Installation by platform

Windows + NVIDIA GPU

uv sync
python setup/install_backends.py --backend cuda

Requires Visual C++ Build Tools for llama.cpp compilation. Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/

GPTQ quantization requires a GPU with 16GB+ VRAM. For systems with less VRAM, use the Kaggle notebook: notebooks/kaggle_gptq.ipynb

TFLite conversion is not supported on Windows. Use the Colab notebook instead: notebooks/colab_tflite.ipynb

macOS (Apple Silicon)

uv sync
python setup/install_backends.py --backend metal

Linux + NVIDIA GPU

uv sync
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

CPU only (any OS)

uv sync
python setup/install_backends.py --backend cpu

Configuration

Copy and edit a sample config:

cp sample_llm.yaml my_model.yaml
model:
  source: huggingface       # or local
  id: Qwen/Qwen2-0.5B
  modality: llm             # llm | vision | audio

quantize:
  formats: [gguf, gptq]
  gguf_levels: [Q2_K, Q4_K_M, Q5_0, Q8_0]
  gptq_levels: [int4]

benchmark:
  metrics: [perplexity, tok_s]
  full_mmlu: false
  soc_target: snapdragon_8_gen_3    # for TFLite sim benchmark
  dataset:
    name: wikitext
    split: test
    source: hf_datasets

Output structure

outputs/
├── models/          # cached HF model weights
├── gguf/            # GGUF quantized files per model
├── gptq/            # GPTQ quantized files per model
├── tflite/          # TFLite converted files per model
├── results/         # benchmark CSVs, unified JSON, Pareto HTML/PNG
└── best_model/      # knee-point model files copied here

Notebooks

  • notebooks/kaggle_gptq.ipynb — GPTQ quantization on Kaggle T4 (16GB VRAM)
  • notebooks/colab_tflite.ipynb — TFLite conversion on Google Colab

Hardware requirements

Task Minimum Recommended
GGUF conversion 8GB RAM 16GB RAM
GGUF inference (7B Q4) 8GB RAM 16GB RAM + any GPU
GPTQ quantization (7B) 16GB VRAM A100 40GB
TFLite conversion CPU only CPU only
Simulated benchmark CPU only CPU only

Known limitations

  • TFLite conversion not supported on Windows (use Colab notebook)
  • GPTQ requires 16GB+ VRAM (use Kaggle notebook for smaller GPUs)
  • Perplexity measured on a short fixed corpus — use --full-mmlu for task-based accuracy (slower)
  • TurboQuant (KV cache quantization) deferred to v2

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_quant_tool-0.1.1.1.tar.gz (301.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auto_quant_tool-0.1.1.1-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file auto_quant_tool-0.1.1.1.tar.gz.

File metadata

  • Download URL: auto_quant_tool-0.1.1.1.tar.gz
  • Upload date:
  • Size: 301.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for auto_quant_tool-0.1.1.1.tar.gz
Algorithm Hash digest
SHA256 f085cea991d2d9a09e0b6c1065888863b7905299a347649512caade6c53f38bb
MD5 da6a643fa575f527ba3d1db1ebc2b4bd
BLAKE2b-256 f7e707367a871cb7c18adf68294208af883b0d9174d5a46e75072044888672bd

See more details on using hashes here.

File details

Details for the file auto_quant_tool-0.1.1.1-py3-none-any.whl.

File metadata

  • Download URL: auto_quant_tool-0.1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for auto_quant_tool-0.1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2396f90cef1b5b46d470a79547c696217f4678310e153378c235dd4c6d4d01ec
MD5 7ca8dff36c685d42576087e0654572bc
BLAKE2b-256 61a69b9d4e84c085c52cfbe651f6351b65dde43ab95b64bbf9e1c9958a6375ea

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page