Neural network quantization toolkit for ONNX models
Project description
quantize-rs (Python)
Fast, accurate neural network quantization for ONNX models. Powered by Rust.
Features
- INT8/INT4 quantization with 4-8× compression
- Activation-based calibration for 3× better accuracy vs weight-only methods
- DequantizeLinear QDQ pattern for ONNX Runtime compatibility
- Blazing fast — Rust implementation with Python bindings
Installation
pip install quantization-rs
Or build from source:
# Install Rust (if needed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install maturin
pip install maturin
# Build and install
maturin develop --release --features python
Quick Start
Basic Quantization
import quantize_rs
# Quantize to INT8
quantize_rs.quantize(
input_path="model.onnx",
output_path="model_int8.onnx",
bits=8
)
# Quantize to INT4 (aggressive compression)
quantize_rs.quantize(
input_path="model.onnx",
output_path="model_int4.onnx",
bits=4,
per_channel=True # Better accuracy for INT4
)
Activation-Based Calibration
For better accuracy, use real inference data:
import quantize_rs
import numpy as np
# Option 1: With calibration data
quantize_rs.quantize_with_calibration(
input_path="resnet18.onnx",
output_path="resnet18_int8.onnx",
calibration_data="calibration_samples.npy", # Shape: [N, C, H, W]
method="minmax"
)
# Option 2: Auto-generate random samples
quantize_rs.quantize_with_calibration(
input_path="resnet18.onnx",
output_path="resnet18_int8.onnx",
num_samples=100,
sample_shape=[3, 224, 224], # ImageNet shape
method="percentile"
)
Model Info
import quantize_rs
info = quantize_rs.model_info("model.onnx")
print(f"Name: {info.name}")
print(f"Nodes: {info.num_nodes}")
print(f"Inputs: {info.inputs}")
print(f"Outputs: {info.outputs}")
API Reference
quantize()
Basic weight-based quantization.
Parameters:
input_path(str): Path to input ONNX modeloutput_path(str): Path to save quantized modelbits(int): Bit width — 4 or 8 (default: 8)per_channel(bool): Per-channel quantization (default: False)
Returns: None
Example:
quantize_rs.quantize("model.onnx", "model_int8.onnx", bits=8)
quantize_with_calibration()
Activation-based calibration quantization for better accuracy.
Parameters:
input_path(str): Path to input ONNX modeloutput_path(str): Path to save quantized modelcalibration_data(str | None): Path to .npy calibration data, or None for random (default: None)bits(int): Bit width — 4 or 8 (default: 8)per_channel(bool): Per-channel quantization (default: False)method(str): Calibration method — "minmax", "percentile", "entropy", "mse" (default: "minmax")num_samples(int): Number of random samples ifcalibration_datais None (default: 100)sample_shape(list[int] | None): Shape of random samples, auto-detected if None (default: None)
Returns: None
Example:
quantize_rs.quantize_with_calibration(
"resnet18.onnx",
"resnet18_int8.onnx",
calibration_data="samples.npy",
method="minmax"
)
Calibration Methods:
minmax: Uses observed min/max values (fast, good baseline)percentile: Clips at 99.9th percentile (reduces outlier impact)entropy: Minimizes KL divergence (best for CNN activations)mse: Minimizes mean squared error (best for Transformers)
model_info()
Get model metadata.
Parameters:
input_path(str): Path to ONNX model
Returns: ModelInfo object with fields:
name(str): Model nameversion(int): ONNX opset versionnum_nodes(int): Number of computation nodesinputs(list[str]): Input tensor names and shapesoutputs(list[str]): Output tensor names and shapes
Example:
info = quantize_rs.model_info("model.onnx")
print(f"{info.name}: {info.num_nodes} nodes")
Performance
Benchmarks on ResNet-18 (ImageNet):
| Method | Accuracy | Compression | Speed |
|---|---|---|---|
| FP32 (baseline) | 69.76% | 1.0× | 1.0× |
| INT8 (weight-only) | 69.52% | 4.0× | 2.8× |
| INT8 (calibrated) | 69.68% | 4.0× | 2.8× |
| INT4 (calibrated) | 68.94% | 8.0× | 3.2× |
Activation-based calibration improves accuracy by 3× vs weight-only (0.08% drop vs 0.24% drop).
Preparing Calibration Data
For best results, use ~100 representative samples from your validation set:
import numpy as np
import onnxruntime as ort
# Load your model
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
# Collect samples from validation set
samples = []
for img in validation_dataset[:100]:
preprocessed = preprocess(img) # Your preprocessing
samples.append(preprocessed)
# Stack and save
calibration_data = np.stack(samples)
np.save("calibration_samples.npy", calibration_data)
# Use in quantization
quantize_rs.quantize_with_calibration(
"model.onnx",
"model_int8.onnx",
calibration_data="calibration_samples.npy"
)
Integration with ONNX Runtime
import onnxruntime as ort
import numpy as np
# Load quantized model
session = ort.InferenceSession("model_int8.onnx")
# Run inference (same API as FP32)
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: your_input})
FAQ
Q: Which bit width should I use?
A: Start with INT8 for maximum compatibility. Use INT4 if you need aggressive compression and can tolerate 0.5-1% accuracy drop.
Q: Do I need calibration data?
A: Not required, but highly recommended. Random data gives 0.2-0.3% worse accuracy than real calibration samples.
Q: What's the speed improvement?
A: 2-3× faster inference on CPU, 3-5× on mobile/edge devices. GPU gains are smaller (1.5-2×).
Q: Will my model still run in ONNX Runtime?
A: Yes! We use the standard DequantizeLinear operator. Any ONNX Runtime version ≥1.10 supports it.
Q: Can I quantize specific layers?
A: Currently quantizes all weights. Per-layer selection coming in v0.4.0.
Limitations
- Input format: ONNX only (PyTorch/TensorFlow → export to ONNX first)
- Operator support: All standard ops supported; custom ops may fail
- Opset version: Requires ONNX opset ≥13 (automatically upgraded if needed)
Contributing
Contributions welcome! Areas we need help:
- Testing - More model architectures and edge cases
- Documentation - Tutorials, guides, examples
- Performance - Optimization and profiling
- Features - Dynamic quantization, mixed precision
License
MIT OR Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quantization_rs-0.3.0.tar.gz.
File metadata
- Download URL: quantization_rs-0.3.0.tar.gz
- Upload date:
- Size: 78.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
415fc3cbabfe9abe4af2e72b1a2de903a2de804533949c626b9e916452977e89
|
|
| MD5 |
7db6c59f5c359f655ad758211e415798
|
|
| BLAKE2b-256 |
2c7560b8512aa9df34e9f1da9e576fff1168cb67d4f5a8628416bce04461ad6d
|
File details
Details for the file quantization_rs-0.3.0-cp313-cp313-win_amd64.whl.
File metadata
- Download URL: quantization_rs-0.3.0-cp313-cp313-win_amd64.whl
- Upload date:
- Size: 6.5 MB
- Tags: CPython 3.13, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28e306703fc4c9ab79ece41ae8c19b2fbcec3e24c5598416de08c60194a77e84
|
|
| MD5 |
e10d9fa81b3a034d5f6463682e004830
|
|
| BLAKE2b-256 |
1d98a43ed5c8d4388d333e1d47bf307bc337f22144ce362c7e8861a2566a39e5
|