Skip to main content

Neural network quantization toolkit for ONNX models

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

quantize(input_path, output_path, bits=8, per_channel=False)

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
bits int 8 Bit width: 4 or 8
per_channel bool False Use per-channel quantization (separate scale/zp per output channel)

Example:

import quantize_rs

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, per_channel=True)

quantize_with_calibration(input_path, output_path, ...)

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
calibration_data str or None None Path to .npy file (shape [N, ...]), or None for random samples
bits int 8 Bit width: 4 or 8
per_channel bool False Per-channel quantization
method str "minmax" Calibration method (see below)
num_samples int 100 Number of random samples when calibration_data is None
sample_shape list[int] or None None Shape of random samples; auto-detected from model if None

Calibration methods:

Method Description
"minmax" Uses observed min/max from activations
"percentile" Clips at 99.9th percentile to reduce outlier sensitivity
"entropy" Selects range minimizing KL divergence between original and quantized distributions
"mse" Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

model_info(input_path)

Returns metadata about an ONNX model.

Parameters:

Name Type Default Description
input_path str required Path to ONNX model

Returns: ModelInfo object with the following fields:

Field Type Description
name str Graph name
version int Model version
num_nodes int Number of computation nodes
inputs list[str] Input tensor names
outputs list[str] Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

  • ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
  • Requires ONNX opset >= 10 for per-tensor quantization, >= 13 for per-channel (automatically upgraded if needed).
  • INT4 values are stored as INT8 bytes in the ONNX file (DequantizeLinear requires INT8 input in opsets < 21).

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.7.0.tar.gz (121.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

quantization_rs-0.7.0-cp312-cp312-win_amd64.whl (6.4 MB view details)

Uploaded CPython 3.12Windows x86-64

quantization_rs-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

quantization_rs-0.7.0-cp312-cp312-macosx_11_0_arm64.whl (5.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

quantization_rs-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

File details

Details for the file quantization_rs-0.7.0.tar.gz.

File metadata

  • Download URL: quantization_rs-0.7.0.tar.gz
  • Upload date:
  • Size: 121.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quantization_rs-0.7.0.tar.gz
Algorithm Hash digest
SHA256 382c37eab9b4467cd664a3d3ba726ebd4e14befe81e5210010b3eac0048d7339
MD5 a93c568d5086a29bba0937ad5a90f5d4
BLAKE2b-256 71d1c47fa260b45acf9f485d6e136c96e8b29cf8478404ffdab16d35198f9b4a

See more details on using hashes here.

File details

Details for the file quantization_rs-0.7.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.7.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a9f7353cd4c6fe7d8b5ee7f694c89c5251dd65b1ca02c869236e12e3edec8200
MD5 d1e8da0c9c010dc456aef4e702032e77
BLAKE2b-256 3eb6bca5d83bf65f37db1b938e1d746dd1add7683d41eb926592a87f3e8ab7fd

See more details on using hashes here.

File details

Details for the file quantization_rs-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.7.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b28a143b399a33fc4a4e30f470d2bb05c1845d7a87677de7d72ea1d90da37ca
MD5 69cb7b01957345dae91b2d672b520d2b
BLAKE2b-256 6791f2daf524be74dc4720fdfea711ffc9340957a9d554103b65eaebad6b8776

See more details on using hashes here.

File details

Details for the file quantization_rs-0.7.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.7.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5817541f8be289a493c707383341865409a22eae9a936f4bba67945210c78d09
MD5 58fb47933aacb8102d19520c2489fd60
BLAKE2b-256 956455a3e94f5cce677be2ee0b91ba6aed45e473a8c8206d88e8dac739ce2a8c

See more details on using hashes here.

File details

Details for the file quantization_rs-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.7.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6c48c9ac3e8e5f3157d985a9683e63f29dbaf05584c64c55c39412f075e93db4
MD5 7c626cbb09a2c01f61fa6a309bb034bb
BLAKE2b-256 521ee4c4ec8408f06cde09624deb673cb4bd335559acb7633fbdbc1a34b95c82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page