Neural network quantization toolkit for ONNX models

These details have not been verified by PyPI

Project links

Project description

quantize-rs (Python)

Fast, accurate neural network quantization for ONNX models. Powered by Rust.

Features

INT8/INT4 quantization with 4-8× compression
Activation-based calibration for 3× better accuracy vs weight-only methods
DequantizeLinear QDQ pattern for ONNX Runtime compatibility
Blazing fast — Rust implementation with Python bindings

Installation

pip install quantization-rs

Or build from source:

# Install Rust (if needed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin
pip install maturin

# Build and install
maturin develop --release --features python

Quick Start

Basic Quantization

import quantize_rs

# Quantize to INT8
quantize_rs.quantize(
    input_path="model.onnx",
    output_path="model_int8.onnx",
    bits=8
)

# Quantize to INT4 (aggressive compression)
quantize_rs.quantize(
    input_path="model.onnx",
    output_path="model_int4.onnx",
    bits=4,
    per_channel=True  # Better accuracy for INT4
)

Activation-Based Calibration

For better accuracy, use real inference data:

import quantize_rs
import numpy as np

# Option 1: With calibration data
quantize_rs.quantize_with_calibration(
    input_path="resnet18.onnx",
    output_path="resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",  # Shape: [N, C, H, W]
    method="minmax"
)

# Option 2: Auto-generate random samples
quantize_rs.quantize_with_calibration(
    input_path="resnet18.onnx",
    output_path="resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],  # ImageNet shape
    method="percentile"
)

Model Info

import quantize_rs

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

API Reference

`quantize()`

Basic weight-based quantization.

Parameters:

input_path (str): Path to input ONNX model
output_path (str): Path to save quantized model
bits (int): Bit width — 4 or 8 (default: 8)
per_channel (bool): Per-channel quantization (default: False)

Returns: None

Example:

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)

`quantize_with_calibration()`

Activation-based calibration quantization for better accuracy.

Parameters:

input_path (str): Path to input ONNX model
output_path (str): Path to save quantized model
calibration_data (str | None): Path to .npy calibration data, or None for random (default: None)
bits (int): Bit width — 4 or 8 (default: 8)
per_channel (bool): Per-channel quantization (default: False)
method (str): Calibration method — "minmax", "percentile", "entropy", "mse" (default: "minmax")
num_samples (int): Number of random samples if calibration_data is None (default: 100)
sample_shape (list[int] | None): Shape of random samples, auto-detected if None (default: None)

Returns: None

Example:

quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="samples.npy",
    method="minmax"
)

Calibration Methods:

minmax: Uses observed min/max values (fast, good baseline)
percentile: Clips at 99.9th percentile (reduces outlier impact)
entropy: Minimizes KL divergence (best for CNN activations)
mse: Minimizes mean squared error (best for Transformers)

`model_info()`

Get model metadata.

Parameters:

input_path (str): Path to ONNX model

Returns: ModelInfo object with fields:

name (str): Model name
version (int): ONNX opset version
num_nodes (int): Number of computation nodes
inputs (list[str]): Input tensor names and shapes
outputs (list[str]): Output tensor names and shapes

Example:

info = quantize_rs.model_info("model.onnx")
print(f"{info.name}: {info.num_nodes} nodes")

Performance

Benchmarks on ResNet-18 (ImageNet):

Method	Accuracy	Compression	Speed
FP32 (baseline)	69.76%	1.0×	1.0×
INT8 (weight-only)	69.52%	4.0×	2.8×
INT8 (calibrated)	69.68%	4.0×	2.8×
INT4 (calibrated)	68.94%	8.0×	3.2×

Activation-based calibration improves accuracy by 3× vs weight-only (0.08% drop vs 0.24% drop).

Preparing Calibration Data

For best results, use ~100 representative samples from your validation set:

import numpy as np
import onnxruntime as ort

# Load your model
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name

# Collect samples from validation set
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # Your preprocessing
    samples.append(preprocessed)

# Stack and save
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use in quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy"
)

Integration with ONNX Runtime

import onnxruntime as ort
import numpy as np

# Load quantized model
session = ort.InferenceSession("model_int8.onnx")

# Run inference (same API as FP32)
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

FAQ

Q: Which bit width should I use?
A: Start with INT8 for maximum compatibility. Use INT4 if you need aggressive compression and can tolerate 0.5-1% accuracy drop.

Q: Do I need calibration data?
A: Not required, but highly recommended. Random data gives 0.2-0.3% worse accuracy than real calibration samples.

Q: What's the speed improvement?
A: 2-3× faster inference on CPU, 3-5× on mobile/edge devices. GPU gains are smaller (1.5-2×).

Q: Will my model still run in ONNX Runtime?
A: Yes! We use the standard DequantizeLinear operator. Any ONNX Runtime version ≥1.10 supports it.

Q: Can I quantize specific layers?
A: Currently quantizes all weights. Per-layer selection coming in v0.4.0.

Limitations

Input format: ONNX only (PyTorch/TensorFlow → export to ONNX first)
Operator support: All standard ops supported; custom ops may fail
Opset version: Requires ONNX opset ≥13 (automatically upgraded if needed)

Contributing

Contributions welcome! Areas we need help:

Testing - More model architectures and edge cases
Documentation - Tutorials, guides, examples
Performance - Optimization and profiling
Features - Dynamic quantization, mixed precision

License

MIT OR Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Apr 26, 2026

0.7.0

Mar 28, 2026

0.6.0

Feb 19, 2026

0.5.0

Feb 18, 2026

0.4.0

Feb 15, 2026

This version

0.3.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.3.0.tar.gz (78.6 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quantization_rs-0.3.0-cp313-cp313-win_amd64.whl (6.5 MB view details)

Uploaded Feb 4, 2026 CPython 3.13Windows x86-64

File details

Details for the file quantization_rs-0.3.0.tar.gz.

File metadata

Download URL: quantization_rs-0.3.0.tar.gz
Upload date: Feb 4, 2026
Size: 78.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`415fc3cbabfe9abe4af2e72b1a2de903a2de804533949c626b9e916452977e89`
MD5	`7db6c59f5c359f655ad758211e415798`
BLAKE2b-256	`2c7560b8512aa9df34e9f1da9e576fff1168cb67d4f5a8628416bce04461ad6d`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.3.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: quantization_rs-0.3.0-cp313-cp313-win_amd64.whl
Upload date: Feb 4, 2026
Size: 6.5 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.3.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`28e306703fc4c9ab79ece41ae8c19b2fbcec3e24c5598416de08c60194a77e84`
MD5	`e10d9fa81b3a034d5f6463682e004830`
BLAKE2b-256	`1d98a43ed5c8d4388d333e1d47bf307bc337f22144ce362c7e8861a2566a39e5`

See more details on using hashes here.

quantization-rs 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quantize-rs (Python)

Features

Installation

Quick Start

Basic Quantization

Activation-Based Calibration

Model Info

API Reference

quantize()

quantize_with_calibration()

model_info()

Performance

Preparing Calibration Data

Integration with ONNX Runtime

FAQ

Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`quantize()`

`quantize_with_calibration()`

`model_info()`