TurboQuant — model quantization and optimization toolkit for edge and resource-constrained deployment.
Project description
TurboQuant
Model quantization & optimization toolkit for edge and resource-constrained deployment.
INT4 · INT8 · FP16 · GPTQ · AWQ · BitsandBytes · Structured pruning · ONNX & TensorRT export
Why TurboQuant?
Modern open-source models are powerful but expensive to serve. Shipping a 7B-parameter LLM in FP16 demands ~14 GB of VRAM; a vision transformer that fits comfortably on a workstation may blow up on a Jetson Orin or a phone. TurboQuant gives you a single, consistent interface to compress, quantize, prune, export, and benchmark models — so you can ship them on the hardware you actually have.
It is built around three principles:
- One API, many backends. Wrap
bitsandbytes,auto-gptq,autoawq, native PyTorch quantization, and ONNX/TensorRT export behind a uniformquantize(model, method=...)interface. - Reproducible benchmarks. Latency, peak memory, model size, and task accuracy (perplexity, classification top-1, etc.) are first-class citizens — every example ships with a comparable benchmark.
- No magic. Each technique is implemented as a small, readable module so it doubles as a reference for how the methods work.
Features
| Category | Techniques |
|---|---|
| Weight quantization | INT8 dynamic & static PTQ, FP16/BF16 casting, INT4 (bitsandbytes NF4 / FP4), GPTQ, AWQ |
| Pruning | Magnitude (unstructured), L1 structured (channel/filter), N:M sparsity helpers |
| Export | ONNX (with onnxslim graph optimization), TensorRT engine builder, ORT quantization |
| Calibration | Per-tensor & per-channel, MinMax / Entropy / Percentile observers |
| Benchmark | Latency (warmup + median + p95), peak GPU/CPU memory, throughput, model size, perplexity, top-k accuracy |
| CLI | turboquant quantize, turboquant prune, turboquant export, turboquant bench |
Installation
The PyPI package is named turboquant-ml (the unsuffixed turboquant
name was taken by an unrelated project). The Python import and CLI are still
just turboquant / tq:
# Core install
pip install turboquant-ml
# With ONNX export
pip install "turboquant-ml[onnx]"
# Full LLM compression stack (GPTQ + AWQ + bitsandbytes)
pip install "turboquant-ml[gptq,awq,bnb,eval]"
# Everything
pip install "turboquant-ml[all]"
import turboquant # import name unchanged
from turboquant import quantize # same API
Note —
bitsandbytes,auto-gptq,autoawqandtensorrtare heavy native dependencies. They are deliberately optional; TurboQuant degrades gracefully when they are missing.
Quick start
Python API
from turboquant import quantize, benchmark
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
# One-line INT4 weight-only quantization via bitsandbytes
qmodel = quantize(model, method="bnb-nf4")
# Benchmark side-by-side
report = benchmark.compare(
baseline=model,
candidate=qmodel,
tokenizer=tok,
prompts=["Explain quantization in one sentence."],
metrics=["latency", "memory", "size", "perplexity"],
)
print(report.as_table())
CLI
# Quantize a HuggingFace model to INT4 with GPTQ + W4A16
tq quantize meta-llama/Llama-3.2-1B \
--method gptq \
--bits 4 \
--group-size 128 \
--calib-dataset wikitext \
--out ./outputs/llama-3.2-1b-gptq
# Structured prune a vision model and re-evaluate
tq prune microsoft/resnet-50 \
--strategy l1-channel \
--sparsity 0.30 \
--eval imagenet-val \
--out ./outputs/resnet50-pruned
# Export to ONNX with INT8 dynamic quantization
tq export ./outputs/resnet50-pruned \
--format onnx \
--quant int8-dynamic \
--opset 17
# Benchmark FP16 vs INT8 vs INT4 on a model
tq bench meta-llama/Llama-3.2-1B --methods fp16,int8-dynamic,bnb-nf4 \
--report ./benchmarks/results/llama32-1b.json
Supported methods at a glance
| Method | Bits | Backend | Calibration | Typical use case |
|---|---|---|---|---|
fp16 / bf16 |
16 | PyTorch | none | Fast, lossless-ish baseline |
int8-dynamic |
8 | PyTorch | none | CPU inference, transformers |
int8-static |
8 | PyTorch | required | CNNs, edge CPUs |
bnb-int8 |
8 | bitsandbytes | none | LLM training & serving on GPU |
bnb-nf4 / bnb-fp4 |
4 | bitsandbytes | none | LLM inference, QLoRA |
gptq |
2–8 | auto-gptq | required | LLM weight-only, best accuracy/bit |
awq |
4 | autoawq | required | LLM weight-only, fast inference |
Reference benchmarks
SmolLM2-135M on CPU (real measured numbers)
python benchmarks/scripts/sweep_cpu.py --model-id HuggingFaceTB/SmolLM2-135M --methods fp32,fp16,bf16,int8-dynamic
| Method | Size (MB) | Forward latency (ms) | Generation throughput (tok/s) |
|---|---|---|---|
| FP32 (baseline) | 513.2 | 31.3 | 32.6 |
| FP16 | 256.7 | 57.2 | 47.5 |
| BF16 | 256.7 | 55.4 | 48.9 |
| INT8 dynamic | 236.6 | 30.7 | 30.0 |
Read this carefully — the result is realistic, not flattering:
- FP16/BF16 cut size in half, and generation throughput goes up ~50% (smaller KV cache wins), but the per-step forward pass is 2× slower because consumer CPUs have no fast FP16 matmul kernel. On a Tensor-Core GPU these numbers flip.
- INT8 dynamic is the smallest (≈54 % off) and matches FP32 forward latency, but generation throughput is similar to FP32 here — the small hidden size of a 135 M model limits how much INT8 GEMM kernels can help.
- The right baseline matters: comparing INT8 to a poorly-quantizable
reference (e.g. GPT-2, which uses
transformers.Conv1Dinstead ofnn.Linear) makes INT8 look bad. Always check what your method actually rewrites —tq methodsplusprint(model)will tell you.
Reproduce
pip install -e ".[viz]" truststore
python benchmarks/scripts/sweep_cpu.py \
--model-id HuggingFaceTB/SmolLM2-135M \
--methods fp32,fp16,bf16,int8-dynamic \
--out benchmarks/results/smollm2_135m.json \
--plot benchmarks/results/smollm2_135m.png
GPU sweeps (Llama-class models with GPTQ / AWQ / NF4) will land here once a CUDA runner is added to CI — contributions welcome.
Architecture
turboquant/
├── quantization/ # Algorithms: int8, fp16, gptq, awq, bnb, observers
├── pruning/ # Magnitude + structured (L1, L2, taylor) + N:M
├── export/ # ONNX, TensorRT, ORT quantization
├── benchmark/ # Latency, memory, perplexity, classification, plot
├── calibration/ # Datasets, dataloaders, observer fitting
├── models/ # Convenience loaders + registry
└── cli.py # Typer-based CLI
Each algorithm lives in a single, readable file with a quantize_* / prune_* function and a short docstring referencing the original paper.
Roadmap
- INT8 dynamic & static PTQ (PyTorch native)
- FP16/BF16 casting
- BitsAndBytes INT8 / NF4 / FP4 wrappers
- GPTQ & AWQ integration
- L1 structured & magnitude pruning
- ONNX export with
onnxslim - Latency / memory / perplexity benchmarks
- TensorRT INT8 calibration cache
- SmoothQuant W8A8
- HQQ (Half-Quadratic Quantization)
- Distillation-aware quantization
- Mobile export (CoreML / TFLite)
- Web dashboard for benchmark comparison
Citing & related work
TurboQuant stands on the shoulders of giants. If you use it in research, please also cite the underlying algorithms:
- GPTQ — Frantar et al., 2023 (arXiv:2210.17323)
- AWQ — Lin et al., 2023 (arXiv:2306.00978)
- LLM.int8() / QLoRA — Dettmers et al., 2022 / 2023 (arXiv:2208.07339, 2305.14314)
- SmoothQuant — Xiao et al., 2022 (arXiv:2211.10438)
Contributing
Contributions are very welcome — see CONTRIBUTING.md. Good first issues are tagged on the issue tracker.
git clone https://github.com/Ademo93/turboquant
cd turboquant
pip install -e ".[dev,all]"
pre-commit install
pytest
License
MIT — do whatever you like, just keep the copyright notice.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turboquant_ml-0.1.0.tar.gz.
File metadata
- Download URL: turboquant_ml-0.1.0.tar.gz
- Upload date:
- Size: 138.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cba93d26f7acb8e5b6319c6be1e538a3961e9cf925b0e7e41b8713efef58297a
|
|
| MD5 |
ff4cc2abb5d25b0a822e426f6269a9c3
|
|
| BLAKE2b-256 |
63e3dddb236317e6190821d1d31c7c80346bc8d825d3d646fb76dcc2a25c7fc4
|
File details
Details for the file turboquant_ml-0.1.0-py3-none-any.whl.
File metadata
- Download URL: turboquant_ml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1a0b46661fa7a92d54ec8406eeca4b37cd2f45c52becf5b4869698fe2354696
|
|
| MD5 |
c62d10de2db796664c404d0c05e42f8a
|
|
| BLAKE2b-256 |
d0087212809d0914dadce4323d706a97a485c42d79a31674f4e4807af918abe5
|