TurboQuant — model quantization and optimization toolkit for edge and resource-constrained deployment.

These details have not been verified by PyPI

Project links

Project description

TurboQuant

Model quantization & optimization toolkit for edge and resource-constrained deployment.
INT4 · INT8 · FP16 · GPTQ · AWQ · BitsandBytes · Structured pruning · ONNX & TensorRT export

Why TurboQuant?

Modern open-source models are powerful but expensive to serve. Shipping a 7B-parameter LLM in FP16 demands ~14 GB of VRAM; a vision transformer that fits comfortably on a workstation may blow up on a Jetson Orin or a phone. TurboQuant gives you a single, consistent interface to compress, quantize, prune, export, and benchmark models — so you can ship them on the hardware you actually have.

It is built around three principles:

One API, many backends. Wrap bitsandbytes, auto-gptq, autoawq, native PyTorch quantization, and ONNX/TensorRT export behind a uniform quantize(model, method=...) interface.
Reproducible benchmarks. Latency, peak memory, model size, and task accuracy (perplexity, classification top-1, etc.) are first-class citizens — every example ships with a comparable benchmark.
No magic. Each technique is implemented as a small, readable module so it doubles as a reference for how the methods work.

Features

Category	Techniques
Weight quantization	INT8 dynamic & static PTQ, FP16/BF16 casting, INT4 (bitsandbytes NF4 / FP4), GPTQ, AWQ
Pruning	Magnitude (unstructured), L1 structured (channel/filter), N:M sparsity helpers
Export	ONNX (with `onnxslim` graph optimization), TensorRT engine builder, ORT quantization
Calibration	Per-tensor & per-channel, MinMax / Entropy / Percentile observers
Benchmark	Latency (warmup + median + p95), peak GPU/CPU memory, throughput, model size, perplexity, top-k accuracy
CLI	`turboquant quantize`, `turboquant prune`, `turboquant export`, `turboquant bench`

Installation

The PyPI package is named turboquant-ml (the unsuffixed turboquant name was taken by an unrelated project). The Python import and CLI are still just turboquant / tq:

# Core install
pip install turboquant-ml

# With ONNX export
pip install "turboquant-ml[onnx]"

# Full LLM compression stack (GPTQ + AWQ + bitsandbytes)
pip install "turboquant-ml[gptq,awq,bnb,eval]"

# Everything
pip install "turboquant-ml[all]"

import turboquant                  # import name unchanged
from turboquant import quantize    # same API

Note — bitsandbytes, auto-gptq, autoawq and tensorrt are heavy native dependencies. They are deliberately optional; TurboQuant degrades gracefully when they are missing.

Quick start

Python API

from turboquant import quantize, benchmark
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

# One-line INT4 weight-only quantization via bitsandbytes
qmodel = quantize(model, method="bnb-nf4")

# Benchmark side-by-side
report = benchmark.compare(
    baseline=model,
    candidate=qmodel,
    tokenizer=tok,
    prompts=["Explain quantization in one sentence."],
    metrics=["latency", "memory", "size", "perplexity"],
)
print(report.as_table())

CLI

# Quantize a HuggingFace model to INT4 with GPTQ + W4A16
tq quantize meta-llama/Llama-3.2-1B \
    --method gptq \
    --bits 4 \
    --group-size 128 \
    --calib-dataset wikitext \
    --out ./outputs/llama-3.2-1b-gptq

# Structured prune a vision model and re-evaluate
tq prune microsoft/resnet-50 \
    --strategy l1-channel \
    --sparsity 0.30 \
    --eval imagenet-val \
    --out ./outputs/resnet50-pruned

# Export to ONNX with INT8 dynamic quantization
tq export ./outputs/resnet50-pruned \
    --format onnx \
    --quant int8-dynamic \
    --opset 17

# Benchmark FP16 vs INT8 vs INT4 on a model
tq bench meta-llama/Llama-3.2-1B --methods fp16,int8-dynamic,bnb-nf4 \
    --report ./benchmarks/results/llama32-1b.json

Supported methods at a glance

Method	Bits	Backend	Calibration	Typical use case
`fp16` / `bf16`	16	PyTorch	none	Fast, lossless-ish baseline
`int8-dynamic`	8	PyTorch	none	CPU inference, transformers
`int8-static`	8	PyTorch	required	CNNs, edge CPUs
`bnb-int8`	8	bitsandbytes	none	LLM training & serving on GPU
`bnb-nf4` / `bnb-fp4`	4	bitsandbytes	none	LLM inference, QLoRA
`gptq`	2–8	auto-gptq	required	LLM weight-only, best accuracy/bit
`awq`	4	autoawq	required	LLM weight-only, fast inference

Reference benchmarks

SmolLM2-135M on CPU (real measured numbers)

python benchmarks/scripts/sweep_cpu.py --model-id HuggingFaceTB/SmolLM2-135M --methods fp32,fp16,bf16,int8-dynamic

Method	Size (MB)	Forward latency (ms)	Generation throughput (tok/s)
FP32 (baseline)	513.2	31.3	32.6
FP16	256.7	57.2	47.5
BF16	256.7	55.4	48.9
INT8 dynamic	236.6	30.7	30.0

Read this carefully — the result is realistic, not flattering:

FP16/BF16 cut size in half, and generation throughput goes up ~50% (smaller KV cache wins), but the per-step forward pass is 2× slower because consumer CPUs have no fast FP16 matmul kernel. On a Tensor-Core GPU these numbers flip.
INT8 dynamic is the smallest (≈54 % off) and matches FP32 forward latency, but generation throughput is similar to FP32 here — the small hidden size of a 135 M model limits how much INT8 GEMM kernels can help.
The right baseline matters: comparing INT8 to a poorly-quantizable reference (e.g. GPT-2, which uses transformers.Conv1D instead of nn.Linear) makes INT8 look bad. Always check what your method actually rewrites — tq methods plus print(model) will tell you.

SmolLM2 sweep

Reproduce

pip install -e ".[viz]" truststore
python benchmarks/scripts/sweep_cpu.py \
    --model-id HuggingFaceTB/SmolLM2-135M \
    --methods fp32,fp16,bf16,int8-dynamic \
    --out benchmarks/results/smollm2_135m.json \
    --plot benchmarks/results/smollm2_135m.png

GPU sweeps (Llama-class models with GPTQ / AWQ / NF4) will land here once a CUDA runner is added to CI — contributions welcome.

Architecture

turboquant/
├── quantization/          # Algorithms: int8, fp16, gptq, awq, bnb, observers
├── pruning/               # Magnitude + structured (L1, L2, taylor) + N:M
├── export/                # ONNX, TensorRT, ORT quantization
├── benchmark/             # Latency, memory, perplexity, classification, plot
├── calibration/           # Datasets, dataloaders, observer fitting
├── models/                # Convenience loaders + registry
└── cli.py                 # Typer-based CLI

Each algorithm lives in a single, readable file with a quantize_* / prune_* function and a short docstring referencing the original paper.

Roadmap

INT8 dynamic & static PTQ (PyTorch native)
FP16/BF16 casting
BitsAndBytes INT8 / NF4 / FP4 wrappers
GPTQ & AWQ integration
L1 structured & magnitude pruning
ONNX export with onnxslim
Latency / memory / perplexity benchmarks
TensorRT INT8 calibration cache
SmoothQuant W8A8
HQQ (Half-Quadratic Quantization)
Distillation-aware quantization
Mobile export (CoreML / TFLite)
Web dashboard for benchmark comparison

Citing & related work

TurboQuant stands on the shoulders of giants. If you use it in research, please also cite the underlying algorithms:

GPTQ — Frantar et al., 2023 (arXiv:2210.17323)
AWQ — Lin et al., 2023 (arXiv:2306.00978)
LLM.int8() / QLoRA — Dettmers et al., 2022 / 2023 (arXiv:2208.07339, 2305.14314)
SmoothQuant — Xiao et al., 2022 (arXiv:2211.10438)

Contributing

Contributions are very welcome — see CONTRIBUTING.md. Good first issues are tagged on the issue tracker.

git clone https://github.com/Ademo93/turboquant
cd turboquant
pip install -e ".[dev,all]"
pre-commit install
pytest

License

MIT — do whatever you like, just keep the copyright notice.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_ml-0.1.0.tar.gz (138.4 kB view details)

Uploaded Jun 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_ml-0.1.0-py3-none-any.whl (34.6 kB view details)

Uploaded Jun 18, 2026 Python 3

File details

Details for the file turboquant_ml-0.1.0.tar.gz.

File metadata

Download URL: turboquant_ml-0.1.0.tar.gz
Upload date: Jun 18, 2026
Size: 138.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for turboquant_ml-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`cba93d26f7acb8e5b6319c6be1e538a3961e9cf925b0e7e41b8713efef58297a`
MD5	`ff4cc2abb5d25b0a822e426f6269a9c3`
BLAKE2b-256	`63e3dddb236317e6190821d1d31c7c80346bc8d825d3d646fb76dcc2a25c7fc4`

See more details on using hashes here.

File details

Details for the file turboquant_ml-0.1.0-py3-none-any.whl.

File metadata

Download URL: turboquant_ml-0.1.0-py3-none-any.whl
Upload date: Jun 18, 2026
Size: 34.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for turboquant_ml-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d1a0b46661fa7a92d54ec8406eeca4b37cd2f45c52becf5b4869698fe2354696`
MD5	`c62d10de2db796664c404d0c05e42f8a`
BLAKE2b-256	`d0087212809d0914dadce4323d706a97a485c42d79a31674f4e4807af918abe5`

See more details on using hashes here.

turboquant-ml 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant

Why TurboQuant?

Features

Installation

Quick start

Python API

CLI

Supported methods at a glance

Reference benchmarks

SmolLM2-135M on CPU (real measured numbers)

Reproduce

Architecture

Roadmap

Citing & related work

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes