Neural network quantization toolkit for ONNX models

These details have not been verified by PyPI

Project links

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

`quantize(input_path, output_path, bits=8, per_channel=False)`

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to input ONNX model
`output_path`	str	required	Path to save quantized model
`bits`	int	8	Bit width: 4 or 8
`per_channel`	bool	False	Use per-channel quantization (separate scale/zp per output channel)

Example:

import quantize_rs

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, per_channel=True)

`quantize_with_calibration(input_path, output_path, ...)`

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to input ONNX model
`output_path`	str	required	Path to save quantized model
`calibration_data`	str or None	None	Path to `.npy` file (shape `[N, ...]`), or None for random samples
`bits`	int	8	Bit width: 4 or 8
`per_channel`	bool	False	Per-channel quantization
`method`	str	"minmax"	Calibration method (see below)
`num_samples`	int	100	Number of random samples when `calibration_data` is None
`sample_shape`	list[int] or None	None	Shape of random samples; auto-detected from model if None

Calibration methods:

Method	Description
`"minmax"`	Uses observed min/max from activations
`"percentile"`	Clips at 99.9th percentile to reduce outlier sensitivity
`"entropy"`	Selects range minimizing KL divergence between original and quantized distributions
`"mse"`	Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

`model_info(input_path)`

Returns metadata about an ONNX model.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to ONNX model

Returns: ModelInfo object with the following fields:

Field	Type	Description
`name`	str	Graph name
`version`	int	Model version
`num_nodes`	int	Number of computation nodes
`inputs`	list[str]	Input tensor names
`outputs`	list[str]	Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
Requires ONNX opset >= 13 (automatically upgraded if needed).
INT4 values are stored as INT8 bytes in the ONNX file (DequantizeLinear requires INT8 input in opsets < 21).
All weight tensors are quantized. Per-layer selection is not yet supported.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Apr 26, 2026

0.7.0

Mar 28, 2026

0.6.0

Feb 19, 2026

0.5.0

Feb 18, 2026

This version

0.4.0

Feb 15, 2026

0.3.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.4.0.tar.gz (77.9 kB view details)

Uploaded Feb 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quantization_rs-0.4.0-cp313-cp313-win_amd64.whl (6.9 MB view details)

Uploaded Feb 15, 2026 CPython 3.13Windows x86-64

File details

Details for the file quantization_rs-0.4.0.tar.gz.

File metadata

Download URL: quantization_rs-0.4.0.tar.gz
Upload date: Feb 15, 2026
Size: 77.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`247fdea443a11cad28f1e7d4f7ed8de4fde1bfa1001c8d134d8a9220a1b82e80`
MD5	`9b9b670bdada5a06bc15742b1af7501a`
BLAKE2b-256	`c12cfe4d4989d45c8fa1cc5785609b2e51ed225f211e9a637a888d76e03b09d4`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.4.0-cp313-cp313-win_amd64.whl.

File metadata

Download URL: quantization_rs-0.4.0-cp313-cp313-win_amd64.whl
Upload date: Feb 15, 2026
Size: 6.9 MB
Tags: CPython 3.13, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.4.0-cp313-cp313-win_amd64.whl
Algorithm	Hash digest
SHA256	`1021bb72d6e1a7cd988c5db86b60dfd32b12e9922125ce321658fed94001459a`
MD5	`9aa127062252fdc8d5fa5d83dd7515aa`
BLAKE2b-256	`1f4ce4cc1ae94653cefca994fb93de806fb03d8d106fb8eea0ed576cfbc28804`

See more details on using hashes here.

quantization-rs 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quantize-rs Python API

Installation

API reference

`quantize(input_path, output_path, bits=8, per_channel=False)`

`quantize_with_calibration(input_path, output_path, ...)`

`model_info(input_path)`

Preparing calibration data

ONNX Runtime integration

Limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes