Skip to main content

Neural network quantization toolkit for ONNX models

Project description

quantize-rs (Python)

Fast, accurate neural network quantization for ONNX models. Powered by Rust.

Features

  • INT8/INT4 quantization with 4-8× compression
  • Activation-based calibration for 3× better accuracy vs weight-only methods
  • DequantizeLinear QDQ pattern for ONNX Runtime compatibility
  • Blazing fast — Rust implementation with Python bindings

Installation

pip install quantization-rs

Or build from source:

# Install Rust (if needed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install maturin
pip install maturin

# Build and install
maturin develop --release --features python

Quick Start

Basic Quantization

import quantize_rs

# Quantize to INT8
quantize_rs.quantize(
    input_path="model.onnx",
    output_path="model_int8.onnx",
    bits=8
)

# Quantize to INT4 (aggressive compression)
quantize_rs.quantize(
    input_path="model.onnx",
    output_path="model_int4.onnx",
    bits=4,
    per_channel=True  # Better accuracy for INT4
)

Activation-Based Calibration

For better accuracy, use real inference data:

import quantize_rs
import numpy as np

# Option 1: With calibration data
quantize_rs.quantize_with_calibration(
    input_path="resnet18.onnx",
    output_path="resnet18_int8.onnx",
    calibration_data="calibration_samples.npy",  # Shape: [N, C, H, W]
    method="minmax"
)

# Option 2: Auto-generate random samples
quantize_rs.quantize_with_calibration(
    input_path="resnet18.onnx",
    output_path="resnet18_int8.onnx",
    num_samples=100,
    sample_shape=[3, 224, 224],  # ImageNet shape
    method="percentile"
)

Model Info

import quantize_rs

info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")

API Reference

quantize()

Basic weight-based quantization.

Parameters:

  • input_path (str): Path to input ONNX model
  • output_path (str): Path to save quantized model
  • bits (int): Bit width — 4 or 8 (default: 8)
  • per_channel (bool): Per-channel quantization (default: False)

Returns: None

Example:

quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)

quantize_with_calibration()

Activation-based calibration quantization for better accuracy.

Parameters:

  • input_path (str): Path to input ONNX model
  • output_path (str): Path to save quantized model
  • calibration_data (str | None): Path to .npy calibration data, or None for random (default: None)
  • bits (int): Bit width — 4 or 8 (default: 8)
  • per_channel (bool): Per-channel quantization (default: False)
  • method (str): Calibration method — "minmax", "percentile", "entropy", "mse" (default: "minmax")
  • num_samples (int): Number of random samples if calibration_data is None (default: 100)
  • sample_shape (list[int] | None): Shape of random samples, auto-detected if None (default: None)

Returns: None

Example:

quantize_rs.quantize_with_calibration(
    "resnet18.onnx",
    "resnet18_int8.onnx",
    calibration_data="samples.npy",
    method="minmax"
)

Calibration Methods:

  • minmax: Uses observed min/max values (fast, good baseline)
  • percentile: Clips at 99.9th percentile (reduces outlier impact)
  • entropy: Minimizes KL divergence (best for CNN activations)
  • mse: Minimizes mean squared error (best for Transformers)

model_info()

Get model metadata.

Parameters:

  • input_path (str): Path to ONNX model

Returns: ModelInfo object with fields:

  • name (str): Model name
  • version (int): ONNX opset version
  • num_nodes (int): Number of computation nodes
  • inputs (list[str]): Input tensor names and shapes
  • outputs (list[str]): Output tensor names and shapes

Example:

info = quantize_rs.model_info("model.onnx")
print(f"{info.name}: {info.num_nodes} nodes")

Performance

Benchmarks on ResNet-18 (ImageNet):

Method Accuracy Compression Speed
FP32 (baseline) 69.76% 1.0× 1.0×
INT8 (weight-only) 69.52% 4.0× 2.8×
INT8 (calibrated) 69.68% 4.0× 2.8×
INT4 (calibrated) 68.94% 8.0× 3.2×

Activation-based calibration improves accuracy by 3× vs weight-only (0.08% drop vs 0.24% drop).

Preparing Calibration Data

For best results, use ~100 representative samples from your validation set:

import numpy as np
import onnxruntime as ort

# Load your model
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name

# Collect samples from validation set
samples = []
for img in validation_dataset[:100]:
    preprocessed = preprocess(img)  # Your preprocessing
    samples.append(preprocessed)

# Stack and save
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)

# Use in quantization
quantize_rs.quantize_with_calibration(
    "model.onnx",
    "model_int8.onnx",
    calibration_data="calibration_samples.npy"
)

Integration with ONNX Runtime

import onnxruntime as ort
import numpy as np

# Load quantized model
session = ort.InferenceSession("model_int8.onnx")

# Run inference (same API as FP32)
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})

FAQ

Q: Which bit width should I use?
A: Start with INT8 for maximum compatibility. Use INT4 if you need aggressive compression and can tolerate 0.5-1% accuracy drop.

Q: Do I need calibration data?
A: Not required, but highly recommended. Random data gives 0.2-0.3% worse accuracy than real calibration samples.

Q: What's the speed improvement?
A: 2-3× faster inference on CPU, 3-5× on mobile/edge devices. GPU gains are smaller (1.5-2×).

Q: Will my model still run in ONNX Runtime?
A: Yes! We use the standard DequantizeLinear operator. Any ONNX Runtime version ≥1.10 supports it.

Q: Can I quantize specific layers?
A: Currently quantizes all weights. Per-layer selection coming in v0.4.0.

Limitations

  • Input format: ONNX only (PyTorch/TensorFlow → export to ONNX first)
  • Operator support: All standard ops supported; custom ops may fail
  • Opset version: Requires ONNX opset ≥13 (automatically upgraded if needed)

Contributing

Contributions welcome! Areas we need help:

  • Testing - More model architectures and edge cases
  • Documentation - Tutorials, guides, examples
  • Performance - Optimization and profiling
  • Features - Dynamic quantization, mixed precision

License

MIT OR Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantization_rs-0.3.0.tar.gz (78.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quantization_rs-0.3.0-cp313-cp313-win_amd64.whl (6.5 MB view details)

Uploaded CPython 3.13Windows x86-64

File details

Details for the file quantization_rs-0.3.0.tar.gz.

File metadata

  • Download URL: quantization_rs-0.3.0.tar.gz
  • Upload date:
  • Size: 78.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for quantization_rs-0.3.0.tar.gz
Algorithm Hash digest
SHA256 415fc3cbabfe9abe4af2e72b1a2de903a2de804533949c626b9e916452977e89
MD5 7db6c59f5c359f655ad758211e415798
BLAKE2b-256 2c7560b8512aa9df34e9f1da9e576fff1168cb67d4f5a8628416bce04461ad6d

See more details on using hashes here.

File details

Details for the file quantization_rs-0.3.0-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for quantization_rs-0.3.0-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 28e306703fc4c9ab79ece41ae8c19b2fbcec3e24c5598416de08c60194a77e84
MD5 e10d9fa81b3a034d5f6463682e004830
BLAKE2b-256 1d98a43ed5c8d4388d333e1d47bf307bc337f22144ce362c7e8861a2566a39e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page