Skip to main content

Neural network quantization toolkit for ONNX models

Project description

quantize-rs Python API

Python bindings for quantize-rs, a neural network quantization toolkit for ONNX models.

Scope

quantize-rs is designed and validated primarily for computer-vision (CNN-style) ONNX models -- ResNet, MobileNet, SqueezeNet, and similar architectures. Weight-only quantization (quantize()) is model-agnostic and works on any FP32 ONNX file. Activation calibration (quantize_with_calibration()) runs inference through tract, whose op coverage is centered on CNNs; transformer / LLM / RNN models may fail to load through tract or hit unsupported ops during calibration.

Installation

pip install quantization-rs

Build from source (requires Rust toolchain and maturin):

pip install maturin
maturin develop --release --features python

API reference

quantize(input_path, output_path, bits=8, per_channel=False, excluded_layers=None, min_elements=0, layer_bits=None, native_int4=False, symmetric=False)

Weight-based quantization. Loads the model, quantizes all weight tensors, and saves the result in ONNX QDQ format.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
bits int 8 Bit width: 4 or 8
per_channel bool False Use per-channel quantization (separate scale/zp per output channel)
excluded_layers list[str] or None None Initializer names to leave in FP32
min_elements int 0 Skip tensors with fewer than N elements (e.g., biases)
layer_bits dict[str, int] or None None Per-layer bit-width overrides, e.g. {"conv1.weight": 4}
native_int4 bool False Store INT4 weights as ONNX DataType.Int4 (opset 21). True 8x on-disk compression but requires opset-21 runtime. No effect on INT8-only models.
symmetric bool False Symmetric quantization (zero_point == 0). Required by most ORT / TensorRT INT8 matmul kernels for per-channel weights.

Example:

import quantize_rs

# Plain INT8
quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)

# INT4 with native opset-21 storage (8x on-disk)
quantize_rs.quantize("model.onnx", "model_int4.onnx", bits=4, native_int4=True)

# Symmetric per-channel INT8 for ORT INT8 matmul kernels
quantize_rs.quantize(
    "model.onnx",
    "model_int8_sym.onnx",
    bits=8,
    per_channel=True,
    symmetric=True,
)

# Mixed precision: some layers INT4, rest INT8
quantize_rs.quantize(
    "model.onnx",
    "out.onnx",
    bits=8,
    layer_bits={"fc.weight": 4},
    excluded_layers=["embedding.weight"],
    min_elements=1024,  # skip small tensors (biases) and keep them FP32
)

quantize_with_calibration(input_path, output_path, calibration_data=None, bits=8, per_channel=False, method="minmax", num_samples=100, sample_shape=None, native_int4=False, symmetric=False)

Activation-based calibration quantization. Runs inference on calibration samples to determine optimal quantization ranges per layer, then quantizes using those ranges. The full filter pipeline (excluded_layers, min_elements, layer_bits) is honored; pass these via quantize() directly if you need to skip layers explicitly.

Parameters:

Name Type Default Description
input_path str required Path to input ONNX model
output_path str required Path to save quantized model
calibration_data str or None None Path to .npy file (shape [N, ...]), or None for random samples
bits int 8 Bit width: 4 or 8
per_channel bool False Per-channel quantization
method str "minmax" Calibration method (see below)
num_samples int 100 Number of random samples when calibration_data is None
sample_shape list[int] or None None Shape of random samples; auto-detected from model if None. Default fallback is [3, 224, 224] (CHW image) -- override for non-image inputs.
native_int4 bool False Store INT4 weights as ONNX DataType.Int4 (opset 21)
symmetric bool False Symmetric quantization (zero_point == 0)

Calibration methods:

Method Description
"minmax" Uses observed min/max from activations
"percentile" Clips at 99.9th percentile to reduce outlier sensitivity
"entropy" Selects range minimizing KL divergence between original and quantized distributions
"mse" Selects range minimizing mean squared error

Example:

import quantize_rs

# With real calibration data
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

# With random samples (auto-detects input shape from model)
quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],
    method="percentile"
)

model_info(input_path)

Returns metadata about an ONNX model.

Parameters:

Name Type Default Description
input_path str required Path to ONNX model

Returns: ModelInfo object with the following fields:

Field Type Description
name str Graph name
version int Model version
num_nodes int Number of computation nodes
inputs list[str] Input tensor names
outputs list[str] Output tensor names

Example:

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

Preparing calibration data

For best results, use 50-200 representative samples from your validation or training set:

import numpy as np

# Collect preprocessed samples
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # your preprocessing pipeline
    samples.append(preprocessed)

# Save as .npy (shape: [num_samples, channels, height, width])
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use during quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy",
    method="minmax"
)

If you do not have calibration data, the function generates random samples. This is adequate for testing but will produce less accurate quantization than real data.

ONNX Runtime integration

Quantized models use the standard DequantizeLinear operator and load directly in ONNX Runtime:

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model_int8.onnx")
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

Limitations

  • ONNX format only. Export PyTorch/TensorFlow models to ONNX before quantizing.
  • Validated primarily on CNN-style vision models. Activation calibration uses tract for inference; transformer / LLM / RNN architectures may report unsupported ops or shape mismatches in quantize_with_calibration(). The plain quantize() (weight-only) function does not use tract and works on any FP32 ONNX model.
  • Requires ONNX opset >= 10 for per-tensor quantization, >= 13 for per-channel (automatically upgraded if needed).
  • INT4 values are stored as INT8 bytes by default. Pass native_int4=True to write them as ONNX DataType.Int4 (opset 21) for true 8x compression -- requires an ONNX runtime with opset-21 support.
  • Single-input models are assumed by random-sample auto shape detection; for multi-input graphs, pass sample_shape explicitly or supply real calibration_data.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.8.0.tar.gz (143.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

quantization_rs-0.8.0-cp312-cp312-win_amd64.whl (6.4 MB view details)

Uploaded CPython 3.12Windows x86-64

quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl (5.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl (5.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

File details

Details for the file quantization_rs-0.8.0.tar.gz.

File metadata

  • Download URL: quantization_rs-0.8.0.tar.gz
  • Upload date:
  • Size: 143.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for quantization_rs-0.8.0.tar.gz
Algorithm Hash digest
SHA256 9b3812ec4b3fb3703ce90b2db3e86235f20a3a052c67d2567edc6adffa8f48e9
MD5 8e521e7ccfef063945bb602e7f00c5e6
BLAKE2b-256 10327c655bf9a5a760ed1a89186799dba9f027ef22a031a077e6240203d05043

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 87b74003b6c6c86f377a4ad02471af57d2d90b60f4853440527384cca22a9849
MD5 4c6b310c4160319828b3fa65b557e687
BLAKE2b-256 727c5351733b0e2db9ce693318051606fc5e98315a5551cabed3db1609f54bc9

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a6fc9fe4c18fe1a6ce3430d70c776dd44663f0a655bf1cf4f635b067d72ecf0
MD5 9b86e305b6d1815f9a4d7d9f328893b1
BLAKE2b-256 0648d973bf118a729705ca09a3fc09dde850899590b139848156f4e2fe1df574

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9c4213bb90c7c63fa2da132a4c5026c19ee6715d880b2c56780d064b3059d32f
MD5 d927f7b2108e39b85ea528cfb4d1e861
BLAKE2b-256 0e334d0f2c3204afd1abd85ae9c445c8bc485a0b1cc63f2d81f6427c844876eb

See more details on using hashes here.

File details

Details for the file quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.8.0-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6a8b6cc7e55db937fe885f636697d26bfb4bc468540adde67e03cc0ef71c4cdf
MD5 22553d1c8836e45a34c7266596be27b3
BLAKE2b-256 8cf072a4db4fb8c39d31dd98632c0403bfa95f9513186ad52704833a0011883b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page