Skip to main content

Neural network quantization toolkit for ONNX models

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

quantize(input_path, output_path, bits=8, per_channel=False)

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
bits int 8 Bit width: 4 or 8
per_channel bool False Use per-channel quantization (separate scale/zp per output channel)

Example:

import quantize_rs

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, per_channel=True)

quantize_with_calibration(input_path, output_path, ...)

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
calibration_data str or None None Path to .npy file (shape [N, ...]), or None for random samples
bits int 8 Bit width: 4 or 8
per_channel bool False Per-channel quantization
method str "minmax" Calibration method (see below)
num_samples int 100 Number of random samples when calibration_data is None
sample_shape list[int] or None None Shape of random samples; auto-detected from model if None

Calibration methods:

Method Description
"minmax" Uses observed min/max from activations
"percentile" Clips at 99.9th percentile to reduce outlier sensitivity
"entropy" Selects range minimizing KL divergence between original and quantized distributions
"mse" Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

model_info(input_path)

Returns metadata about an ONNX model.

Parameters:

Name Type Default Description
input_path str required Path to ONNX model

Returns: ModelInfo object with the following fields:

Field Type Description
name str Graph name
version int Model version
num_nodes int Number of computation nodes
inputs list[str] Input tensor names
outputs list[str] Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

  • ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
  • Requires ONNX opset >= 13 (automatically upgraded if needed).
  • INT4 values are stored as INT8 bytes in the ONNX file (DequantizeLinear requires INT8 input in opsets < 21).
  • All weight tensors are quantized. Per-layer selection is not yet supported.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.4.0.tar.gz (77.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quantization_rs-0.4.0-cp313-cp313-win_amd64.whl (6.9 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file quantization_rs-0.4.0.tar.gz.

File metadata

  • Download URL: quantization_rs-0.4.0.tar.gz
  • Upload date:
  • Size: 77.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.4.0.tar.gz
Algorithm Hash digest
SHA256 247fdea443a11cad28f1e7d4f7ed8de4fde1bfa1001c8d134d8a9220a1b82e80
MD5 9b9b670bdada5a06bc15742b1af7501a
BLAKE2b-256 c12cfe4d4989d45c8fa1cc5785609b2e51ed225f211e9a637a888d76e03b09d4

See more details on using hashes here.

File details

Details for the file quantization_rs-0.4.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.4.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 1021bb72d6e1a7cd988c5db86b60dfd32b12e9922125ce321658fed94001459a
MD5 9aa127062252fdc8d5fa5d83dd7515aa
BLAKE2b-256 1f4ce4cc1ae94653cefca994fb93de806fb03d8d106fb8eea0ed576cfbc28804

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page