Neural network quantization toolkit for ONNX models

These details have not been verified by PyPI

Project links

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Scope

quantize-rs is designed and validated primarily for computer-vision (CNN-style) ONNX models -- ResNet, MobileNet, SqueezeNet, and similar architectures. Weight-only quantization (quantize()) is model-agnostic and works on any FP32 ONNX file. Activation calibration (quantize_with_calibration()) runs inference through tract, whose op coverage is centered on CNNs; transformer / LLM / RNN models may fail to load through tract or hit unsupported ops during calibration.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

`quantize(input_path, output_path, bits=8, per_channel=False, excluded_layers=None, min_elements=0, layer_bits=None, native_int4=False, symmetric=False)`

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to input ONNX model
`output_path`	str	required	Path to save quantized model
`bits`	int	8	Bit width: 4 or 8
`per_channel`	bool	False	Use per-channel quantization (separate scale/zp per output channel)
`excluded_layers`	list[str] or None	None	Initializer names to leave in FP32
`min_elements`	int	0	Skip tensors with fewer than N elements (e.g., biases)
`layer_bits`	dict[str, int] or None	None	Per-layer bit-width overrides, e.g. `{"conv1.weight": 4}`
`native_int4`	bool	False	Store INT4 weights as ONNX `DataType.Int4` (opset 21). True 8x on-disk compression but requires opset-21 runtime. No effect on INT8-only models.
`symmetric`	bool	False	Symmetric quantization (`zero_point == 0`). Required by most ORT / TensorRT INT8 matmul kernels for per-channel weights.

Example:

import quantize_rs

# Plain INT8
quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)

# INT4 with native opset-21 storage (8x on-disk)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, native_int4=True)

# Symmetric per-channel INT8 for ORT INT8 matmul kernels
quantize_rs.quantize(
    "model.onnx",
    "model_int8_sym.onnx",
    bits=8,
    per_channel=True,
    symmetric=True,
)

# Mixed precision: some layers INT4, rest INT8
quantize_rs.quantize(
    "model.onnx",
    "out.onnx",
    bits=8,
    layer_bits={"fc.weight": 4},
    excluded_layers=["embedding.weight"],
    min_elements=1024,  # skip small tensors (biases) and keep them FP32
)

`quantize_with_calibration(input_path, output_path, calibration_data=None, bits=8, per_channel=False, method="minmax", num_samples=100, sample_shape=None, native_int4=False, symmetric=False)`

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges. The full filter pipeline (excluded_layers, min_elements, layer_bits) is honored; pass these via quantize() directly if you need to skip layers explicitly.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to input ONNX model
`output_path`	str	required	Path to save quantized model
`calibration_data`	str or None	None	Path to `.npy` file (shape `[N, ...]`), or None for random samples
`bits`	int	8	Bit width: 4 or 8
`per_channel`	bool	False	Per-channel quantization
`method`	str	"minmax"	Calibration method (see below)
`num_samples`	int	100	Number of random samples when `calibration_data` is None
`sample_shape`	list[int] or None	None	Shape of random samples; auto-detected from model if None. Default fallback is `[3, 224, 224]` (CHW image) -- override for non-image inputs.
`native_int4`	bool	False	Store INT4 weights as ONNX `DataType.Int4` (opset 21)
`symmetric`	bool	False	Symmetric quantization (`zero_point == 0`)

Calibration methods:

Method	Description
`"minmax"`	Uses observed min/max from activations
`"percentile"`	Clips at 99.9th percentile to reduce outlier sensitivity
`"entropy"`	Selects range minimizing KL divergence between original and quantized distributions
`"mse"`	Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

`model_info(input_path)`

Returns metadata about an ONNX model.

Parameters:

Name	Type	Default	Description
`input_path`	str	required	Path to ONNX model

Returns: ModelInfo object with the following fields:

Field	Type	Description
`name`	str	Graph name
`version`	int	Model version
`num_nodes`	int	Number of computation nodes
`inputs`	list[str]	Input tensor names
`outputs`	list[str]	Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
Validated primarily on CNN-style vision models. Activation calibration uses tract for inference; transformer / LLM / RNN architectures may report unsupported ops or shape mismatches in quantize_with_calibration(). The plain quantize() (weight-only) function does not use tract and works on any FP32 ONNX model.
Requires ONNX opset >= 10 for per-tensor quantization, >= 13 for per-channel (automatically upgraded if needed).
INT4 values are stored as INT8 bytes by default. Pass native_int4=True to write them as ONNX DataType.Int4 (opset 21) for true 8x compression -- requires an ONNX runtime with opset-21 support.
Single-input models are assumed by random-sample auto shape detection; for multi-input graphs, pass sample_shape explicitly or supply real calibration_data.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.8.0

Apr 26, 2026

0.7.0

Mar 28, 2026

0.6.0

Feb 19, 2026

0.5.0

Feb 18, 2026

0.4.0

Feb 15, 2026

0.3.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.8.0.tar.gz (143.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

quantization_rs-0.8.0-cp312-cp312-win_amd64.whl (6.4 MB view details)

Uploaded Apr 26, 2026 CPython 3.12Windows x86-64

quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded Apr 26, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded Apr 26, 2026 CPython 3.12macOS 11.0+ ARM64

quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl (5.8 MB view details)

Uploaded Apr 26, 2026 CPython 3.12macOS 10.12+ x86-64

File details

Details for the file quantization_rs-0.8.0.tar.gz.

File metadata

Download URL: quantization_rs-0.8.0.tar.gz
Upload date: Apr 26, 2026
Size: 143.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`9b3812ec4b3fb3703ce90b2db3e86235f20a3a052c67d2567edc6adffa8f48e9`
MD5	`8e521e7ccfef063945bb602e7f00c5e6`
BLAKE2b-256	`10327c655bf9a5a760ed1a89186799dba9f027ef22a031a077e6240203d05043`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-win_amd64.whl.

File metadata

Download URL: quantization_rs-0.8.0-cp312-cp312-win_amd64.whl
Upload date: Apr 26, 2026
Size: 6.4 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`87b74003b6c6c86f377a4ad02471af57d2d90b60f4853440527384cca22a9849`
MD5	`4c6b310c4160319828b3fa65b557e687`
BLAKE2b-256	`727c5351733b0e2db9ce693318051606fc5e98315a5551cabed3db1609f54bc9`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Apr 26, 2026
Size: 6.1 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`2a6fc9fe4c18fe1a6ce3430d70c776dd44663f0a655bf1cf4f635b067d72ecf0`
MD5	`9b86e305b6d1815f9a4d7d9f328893b1`
BLAKE2b-256	`0648d973bf118a729705ca09a3fc09dde850899590b139848156f4e2fe1df574`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Apr 26, 2026
Size: 5.1 MB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`9c4213bb90c7c63fa2da132a4c5026c19ee6715d880b2c56780d064b3059d32f`
MD5	`d927f7b2108e39b85ea528cfb4d1e861`
BLAKE2b-256	`0e334d0f2c3204afd1abd85ae9c445c8bc485a0b1cc63f2d81f6427c844876eb`

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

Download URL: quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl
Upload date: Apr 26, 2026
Size: 5.8 MB
Tags: CPython 3.12, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`6a8b6cc7e55db937fe885f636697d26bfb4bc468540adde67e03cc0ef71c4cdf`
MD5	`22553d1c8836e45a34c7266596be27b3`
BLAKE2b-256	`8cf072a4db4fb8c39d31dd98632c0403bfa95f9513186ad52704833a0011883b`

See more details on using hashes here.

quantization-rs 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

quantize-rs Python API

Scope

Installation

API reference

quantize(input_path, output_path, bits=8, per_channel=False, excluded_layers=None, min_elements=0, layer_bits=None, native_int4=False, symmetric=False)

quantize_with_calibration(input_path, output_path, calibration_data=None, bits=8, per_channel=False, method="minmax", num_samples=100, sample_shape=None, native_int4=False, symmetric=False)

model_info(input_path)

Preparing calibration data

ONNX Runtime integration

Limitations

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

`quantize(input_path, output_path, bits=8, per_channel=False, excluded_layers=None, min_elements=0, layer_bits=None, native_int4=False, symmetric=False)`

`quantize_with_calibration(input_path, output_path, calibration_data=None, bits=8, per_channel=False, method="minmax", num_samples=100, sample_shape=None, native_int4=False, symmetric=False)`

`model_info(input_path)`