Utilities to rewrite ONNX convolution patterns into matrix multiplication forms.
Project description
onnx-conv2matmul
onnx-conv2matmul is a pre-quantization rewrite tool for ONNX models.
It converts compatible Conv 1x1 layers into equivalent MatMul subgraphs so your
quantization stack can treat more model weights with MatMul-focused kernels
(for example int4 weight-only flows).
Why this matters in practice:
- Many backends optimize/quantize
MatMulmore aggressively thanConv - Pointwise convs often dominate encoder blocks and are good candidates for rewrite
- You can reduce model size and improve throughput potential without changing model semantics
The tool is conservative by default, keeps unsupported layers untouched, and emits a detailed report of converted vs skipped nodes with explicit reasons.
Install
From PyPI:
pip install onnx-conv2matmul
Basic Usage
Rewrite to a new file:
onnx-conv2matmul input.onnx output.onnx
Overwrite in-place:
onnx-conv2matmul input.onnx --inplace
Rewrite and verify in one command (CPU):
onnx-conv2matmul input.onnx output.onnx \
--extended-conv1x1 \
--verify \
--verify-deterministic-cpu \
--verify-lengths 64,80,97,128,191
Parakeet Example (Pre-Quantization)
Reference model page:
Typical pre-quantization step for the encoder graph:
onnx-conv2matmul encoder-model.onnx encoder-model.preq.onnx \
--extended-conv1x1 \
--allow-non-unit-dilation \
--max-dilation 4 \
--skip-checker \
--report-json encoder-model.preq.report.json
This converts compatible pointwise conv layers (Conv1D k=1 and Conv2D 1x1) to MatMul and writes a JSON report
with converted/skipped nodes and reasons.
Parakeet: From Original NeMo Checkpoint to Pre-Quantization
If you want to start from the true original FP32 PyTorch release:
you can export the ONNX graphs directly from the official .nemo checkpoint using the nemo_toolkit library.
1. Install NeMo:
pip install nemo_toolkit[asr]
2. Export the FP32 ONNX Encoder Graph:
import nemo.collections.asr as nemo_asr
# Download from HF and load the PyTorch NeMo model
model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")
model.eval()
# Export the encoder directly to match the filename convention
model.encoder.export("encoder-model.onnx")
(Note: Pre-exported ONNX repositories like istupakov/parakeet-tdt-0.6b-v3-onnx are convenient and contain also decoder/joint, but the script above ensures you are exporting the absolute official source directly from NVIDIA's PyTorch checkpoint).
3. Run pre-quantization rewrite on the exported encoder:
onnx-conv2matmul parakeet-encoder.onnx encoder-model.preq.onnx \
--extended-conv1x1 \
--allow-non-unit-dilation \
--max-dilation 4 \
--skip-checker \
--report-json encoder-model.preq.report.json
4. Check conversion summary
cat encoder-model.preq.report.json
5. Verify I/O equivalence (robust numerical check)
The CLI includes a built-in strict CPU deterministic verification. It runs inputs through both models to ensure the maximum numerical deviation remains within safe float32 bounds (max_abs <= 3e-5, mean_abs <= 2e-6).
onnx-conv2matmul parakeet-encoder.onnx encoder-model.preq.onnx \
--verify \
--verify-signal-input-name audio_signal \
--verify-length-input-name length \
--verify-output-index 0 \
--verify-length-output-index 1 \
--verify-deterministic-cpu \
--verify-lengths 64,80,97,128,191
If it prints Verification PASSED, the rewrite is numerically transparent.
Tip: the source model includes encoder-model.onnx.data as external weights; keep it in
the same directory as encoder-model.onnx while rewriting.
The CLI automatically loads external data and, when needed, writes rewritten artifacts as
<output>.onnx plus <output>.onnx.data.
Hybrid Quantization Workflow (Aligned with Parakeet INT4 Release)
If your goal is to reproduce the published Parakeet hybrid pipeline, use this sequence:
- Rewrite compatible pointwise conv layers (
Conv1D k=1andConv2D 1x1) to MatMul (this tool). - Quantize encoder linear + pointwise layers with int4
MatMulNBits(block_size=64, asymmetric). - Keep depthwise conv layers in FP32 (the ONNX backend manages this automatically if left unconverted).
- Quantize decoder/joint with int8 dynamic quantization.
Example Step 2 (INT4 Quantization in Python):
from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer
# Must use block_size=64 and is_symmetric=False for audio/speech models
# (like Parakeet) to avoid severe degradation in embedding Cosine Similarity.
quant = MatMulNBitsQuantizer(
'encoder-model.preq.onnx',
block_size=64,
is_symmetric=False,
accuracy_level=0
)
quant.process()
# Crucial: Save with use_external_data_format=True to de-duplicate Protobuf
# serialization overhead. This saves an extra ~20MB compared to a single file.
quant.model.save_model_to_file('encoder-model.int4.onnx', use_external_data_format=True)
Reference release and details:
Important: this repository covers step 1 (pre-quantization rewrite). The final 409 MB hybrid artifact depends on the downstream quantization stack and settings used in steps 2-4.
Why this matters
MatMulBnb4/NF4 and MatMulNBits are not equivalent quantization paths. If you compare
with the wrong quantizer family, you can get size and coverage numbers that do not match
the published hybrid result.
Compare package size after full pipeline
After you have produced final artifacts (encoder-model.int4.onnx, decoder_joint-model.int8.onnx,
optional nemo128.int8.onnx, plus vocab.txt and config.json), you can measure bundle size:
python - <<'PY'
from pathlib import Path
files = [
Path("encoder-model.int4.onnx"),
Path("decoder_joint-model.int8.onnx"),
Path("nemo128.int8.onnx"),
Path("vocab.txt"),
Path("config.json"),
]
existing = [p for p in files if p.exists()]
total = sum(p.stat().st_size for p in existing)
print("files used:")
for p in existing:
print(f"- {p.name}: {p.stat().st_size / (1024 * 1024):.2f} MB")
print("---")
print(f"bundle total: {total / (1024 * 1024):.2f} MB")
PY
Useful Options
Enable extended guarded mode:
Use this when you want to convert more Conv1x1 layers than strict mode.
It enables extra safe patterns (for example explicit padding or stride > 1),
while keeping guardrails: unsupported or risky layers are skipped, not forced.
onnx-conv2matmul input.onnx output.onnx --extended-conv1x1
Allow non-unit dilation (explicit opt-in):
Enable this only if your model uses dilated Conv1x1 and you want those layers
to be considered for rewrite too. It is opt-in because dilation can increase
conversion risk; keep it off unless you need that extra coverage.
onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation
Set a max dilation guardrail:
onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation --max-dilation 4
Write a detailed JSON report:
onnx-conv2matmul input.onnx output.onnx --report-json report.json
Print JSON report to stdout:
onnx-conv2matmul input.onnx output.onnx --report-json-stdout
Skip ONNX checker (useful for very large/external-data models):
onnx-conv2matmul input.onnx output.onnx --skip-checker
Compatibility Rules
Strict mode rewrites only conservative cases:
Convwith constant pointwise weights of shape[M, C, 1](Conv1D) or[M, C, 1, 1](Conv2D)group=1- unit stride, unit dilation, zero explicit pads,
auto_pad=NOTSET - Optional bias is supported only when constant
Extended mode (--extended-conv1x1) also supports compatible pointwise conv with explicit
padding and/or stride > 1, with guardrails.
All unsupported Conv nodes are left unchanged and reported with a reason.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onnx_conv2matmul-0.1.1.tar.gz.
File metadata
- Download URL: onnx_conv2matmul-0.1.1.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fb654fc650a7999c61cf45675d760a65516ebb4a6ac563af36f278eb3659d11
|
|
| MD5 |
6428e04973c995b0f22299463dd2b082
|
|
| BLAKE2b-256 |
bbba9d4cee5c4cf8765f749fd69fe5e60b27043c67e11c25670f70bb2dac4648
|
File details
Details for the file onnx_conv2matmul-0.1.1-py3-none-any.whl.
File metadata
- Download URL: onnx_conv2matmul-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8661f7787297130da8e31e72438b91cb919a2d29cfa77813a56e6ad8f1deefcb
|
|
| MD5 |
ecc6c2112767dca5e659defebc6ec0fc
|
|
| BLAKE2b-256 |
df90f726ae825456983bafe44309059c92805c20f8babedf4469282d7d1c085c
|