Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models
Project description
Auto-Quant-Tool
Automated quantization benchmarking suite for GGUF, GPTQ, and TFLite models. Pulls a model from HuggingFace, generates multiple quantized variants, benchmarks them on your hardware, and outputs a Pareto frontier showing the best accuracy-to-speed tradeoff.
Supported formats
- GGUF (Q2 through Q8) — for llama.cpp / Ollama local inference
- GPTQ (INT4, INT8) — for GPU inference via gptqmodel
- TFLite (FP32, FP16, INT8) — for mobile deployment
Quick start
1. Clone the repo
git clone --recurse-submodules https://github.com/YOUR_USERNAME/auto-quant-tool.git
cd auto-quant-tool
2. Base install (all platforms)
uv sync
3. Hardware backend (run once, auto-detects your system)
python setup/install_backends.py
4. Launch the web UI
uv run python -m auto_quant_tool.cli ui
Then open http://localhost:7860 in your browser.
5. Or run via CLI
uv run python -m auto_quant_tool.cli run --config sample_llm.yaml
Installation by platform
Windows + NVIDIA GPU
uv sync
python setup/install_backends.py --backend cuda
Requires Visual C++ Build Tools for llama.cpp compilation. Download: https://visualstudio.microsoft.com/visual-cpp-build-tools/
GPTQ quantization requires a GPU with 16GB+ VRAM.
For systems with less VRAM, use the Kaggle notebook:
notebooks/kaggle_gptq.ipynb
TFLite conversion is not supported on Windows.
Use the Colab notebook instead: notebooks/colab_tflite.ipynb
macOS (Apple Silicon)
uv sync
python setup/install_backends.py --backend metal
Linux + NVIDIA GPU
uv sync
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
CPU only (any OS)
uv sync
python setup/install_backends.py --backend cpu
Configuration
Copy and edit a sample config:
cp sample_llm.yaml my_model.yaml
model:
source: huggingface # or local
id: Qwen/Qwen2-0.5B
modality: llm # llm | vision | audio
quantize:
formats: [gguf, gptq]
gguf_levels: [Q2_K, Q4_K_M, Q5_0, Q8_0]
gptq_levels: [int4]
benchmark:
metrics: [perplexity, tok_s]
full_mmlu: false
soc_target: snapdragon_8_gen_3 # for TFLite sim benchmark
dataset:
name: wikitext
split: test
source: hf_datasets
Output structure
outputs/
├── models/ # cached HF model weights
├── gguf/ # GGUF quantized files per model
├── gptq/ # GPTQ quantized files per model
├── tflite/ # TFLite converted files per model
├── results/ # benchmark CSVs, unified JSON, Pareto HTML/PNG
└── best_model/ # knee-point model files copied here
Notebooks
notebooks/kaggle_gptq.ipynb— GPTQ quantization on Kaggle T4 (16GB VRAM)notebooks/colab_tflite.ipynb— TFLite conversion on Google Colab
Hardware requirements
| Task | Minimum | Recommended |
|---|---|---|
| GGUF conversion | 8GB RAM | 16GB RAM |
| GGUF inference (7B Q4) | 8GB RAM | 16GB RAM + any GPU |
| GPTQ quantization (7B) | 16GB VRAM | A100 40GB |
| TFLite conversion | CPU only | CPU only |
| Simulated benchmark | CPU only | CPU only |
Known limitations
- TFLite conversion not supported on Windows (use Colab notebook)
- GPTQ requires 16GB+ VRAM (use Kaggle notebook for smaller GPUs)
- Perplexity measured on a short fixed corpus — use
--full-mmlufor task-based accuracy (slower) - TurboQuant (KV cache quantization) deferred to v2
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file auto_quant_tool-0.1.1.1.tar.gz.
File metadata
- Download URL: auto_quant_tool-0.1.1.1.tar.gz
- Upload date:
- Size: 301.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f085cea991d2d9a09e0b6c1065888863b7905299a347649512caade6c53f38bb
|
|
| MD5 |
da6a643fa575f527ba3d1db1ebc2b4bd
|
|
| BLAKE2b-256 |
f7e707367a871cb7c18adf68294208af883b0d9174d5a46e75072044888672bd
|
File details
Details for the file auto_quant_tool-0.1.1.1-py3-none-any.whl.
File metadata
- Download URL: auto_quant_tool-0.1.1.1-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2396f90cef1b5b46d470a79547c696217f4678310e153378c235dd4c6d4d01ec
|
|
| MD5 |
7ca8dff36c685d42576087e0654572bc
|
|
| BLAKE2b-256 |
61a69b9d4e84c085c52cfbe651f6351b65dde43ab95b64bbf9e1c9958a6375ea
|