Skip to main content

Neural network quantization toolkit for ONNX models

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

quantize(input_path, output_path, bits=8, per_channel=False)

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
bits int 8 Bit width: 4 or 8
per_channel bool False Use per-channel quantization (separate scale/zp per output channel)

Example:

import quantize_rs

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, per_channel=True)

quantize_with_calibration(input_path, output_path, ...)

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
calibration_data str or None None Path to .npy file (shape [N, ...]), or None for random samples
bits int 8 Bit width: 4 or 8
per_channel bool False Per-channel quantization
method str "minmax" Calibration method (see below)
num_samples int 100 Number of random samples when calibration_data is None
sample_shape list[int] or None None Shape of random samples; auto-detected from model if None

Calibration methods:

Method Description
"minmax" Uses observed min/max from activations
"percentile" Clips at 99.9th percentile to reduce outlier sensitivity
"entropy" Selects range minimizing KL divergence between original and quantized distributions
"mse" Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

model_info(input_path)

Returns metadata about an ONNX model.

Parameters:

Name Type Default Description
input_path str required Path to ONNX model

Returns: ModelInfo object with the following fields:

Field Type Description
name str Graph name
version int Model version
num_nodes int Number of computation nodes
inputs list[str] Input tensor names
outputs list[str] Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

  • ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
  • Requires ONNX opset >= 13 (automatically upgraded if needed).
  • INT4 values are stored as INT8 bytes in the ONNX file (DequantizeLinear requires INT8 input in opsets < 21).
  • All weight tensors are quantized. Per-layer selection is not yet supported.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.6.0.tar.gz (113.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

quantization_rs-0.6.0-cp312-cp312-win_amd64.whl (6.4 MB view details)

Uploaded CPython 3.12Windows x86-64

quantization_rs-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

quantization_rs-0.6.0-cp312-cp312-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

quantization_rs-0.6.0-cp312-cp312-macosx_10_12_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

File details

Details for the file quantization_rs-0.6.0.tar.gz.

File metadata

  • Download URL: quantization_rs-0.6.0.tar.gz
  • Upload date:
  • Size: 113.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quantization_rs-0.6.0.tar.gz
Algorithm Hash digest
SHA256 09594140bd29728ff5dc1e77c122476cf938804be6e6cd67f9cce7b107ec51a5
MD5 aa23fd3b758705ea483518009f6eec75
BLAKE2b-256 8f79515eb8956da781f59be8b807293e7595a57ed19a4876cd25e9b9928786b6

See more details on using hashes here.

File details

Details for the file quantization_rs-0.6.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.6.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 626d0f75fb2cf4efc731191e31f8d508f3f35b9a665bb0a9492bd5a0649c704f
MD5 beb64f52762f22622984683a71e0d46d
BLAKE2b-256 2608c11bc1217e2ea5c6a5a349066250150c37dfc628deadc813d4ef4d950243

See more details on using hashes here.

File details

Details for the file quantization_rs-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0d3dff9a78d8119ceba7d517beaad18a4e91fa793e5e525e67c80cdd531086bd
MD5 9ddc067b411aa49969641d8eca3b39cc
BLAKE2b-256 227bd7f118aac84e63bf8b28a161894df8d05bd70d880830d32fa40008da0454

See more details on using hashes here.

File details

Details for the file quantization_rs-0.6.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.6.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ad2fd9bd78a276739a11e0d3943223272be22b4ab36da8f8fc488da96bfd8a66
MD5 9655e120a6a4d5397fdfc30b1f3f1081
BLAKE2b-256 a5c0e7bdf9e9e17e4fa42edf1c5e32f1efccc7e08ca2d7dfcedb5d80f6c058f7

See more details on using hashes here.

File details

Details for the file quantization_rs-0.6.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.6.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 01f8db46f1efffd5ecfeec879eb633c428aa42defa1a2ef3931a6a4b3303056e
MD5 dc9df4fde1331a9f6c738803736414cc
BLAKE2b-256 c1bfadac6c0a9140608e610ac4f74c24b740afa4bbdd1bbb1adcd735e56d578a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page