onnx-conv2matmul

Utilities to rewrite ONNX convolution patterns into matrix multiplication forms.

Project description

onnx-conv2matmul

onnx-conv2matmul is a pre-quantization rewrite tool for ONNX models.

It converts compatible Conv 1x1 layers into equivalent MatMul subgraphs so your quantization stack can treat more model weights with MatMul-focused kernels (for example int4 weight-only flows).

Why this matters in practice:

Many backends optimize/quantize MatMul more aggressively than Conv
Pointwise convs often dominate encoder blocks and are good candidates for rewrite
You can reduce model size and improve throughput potential without changing model semantics

The tool is conservative by default, keeps unsupported layers untouched, and emits a detailed report of converted vs skipped nodes with explicit reasons.

Install

From PyPI:

pip install onnx-conv2matmul

Basic Usage

Rewrite to a new file:

onnx-conv2matmul input.onnx output.onnx

Overwrite in-place:

onnx-conv2matmul input.onnx --inplace

Rewrite and verify in one command (CPU):

onnx-conv2matmul input.onnx output.onnx \
	--extended-conv1x1 \
	--verify \
	--verify-deterministic-cpu \
	--verify-lengths 64,80,97,128,191

Parakeet Example (Pre-Quantization)

Reference model page:

https://huggingface.co/efederici/parakeet-tdt-0.6b-v3-onnx-int4

Typical pre-quantization step for the encoder graph:

onnx-conv2matmul encoder-model.onnx encoder-model.preq.onnx \
	--extended-conv1x1 \
	--allow-non-unit-dilation \
	--max-dilation 4 \
	--skip-checker \
	--report-json encoder-model.preq.report.json

This converts compatible pointwise conv layers (Conv1D k=1 and Conv2D 1x1) to MatMul and writes a JSON report with converted/skipped nodes and reasons.

Hugging Face FP32 -> Pre-Quantization Workflow

If you start from the original FP32 ONNX release:

https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx

you can download only the files needed for encoder pre-quantization and run the rewrite.

Download required files from Hugging Face

pip install "huggingface_hub[cli]"
hf download istupakov/parakeet-tdt-0.6b-v3-onnx \
	--local-dir ./hf_parakeet_fp32 \
	--include encoder-model.onnx encoder-model.onnx.data config.json vocab.txt

Run pre-quantization rewrite on encoder

cd hf_parakeet_fp32
onnx-conv2matmul encoder-model.onnx encoder-model.preq.onnx \
	--extended-conv1x1 \
	--allow-non-unit-dilation \
	--max-dilation 4 \
	--skip-checker \
	--report-json encoder-model.preq.report.json

Check conversion summary

cat encoder-model.preq.report.json

Verify I/O equivalence (robust numerical check)

The CLI includes a built-in strict CPU deterministic verification. It runs inputs through both models to ensure the maximum numerical deviation remains within safe float32 bounds (max_abs <= 3e-5, mean_abs <= 2e-6).

onnx-conv2matmul encoder-model.onnx encoder-model.preq.onnx \
	--verify \
	--verify-signal-input-name audio_signal \
	--verify-length-input-name length \
	--verify-output-index 0 \
	--verify-length-output-index 1 \
	--verify-deterministic-cpu \
	--verify-lengths 64,80,97,128,191

If it prints Verification PASSED, the rewrite is numerically transparent.

Tip: the source model includes encoder-model.onnx.data as external weights; keep it in the same directory as encoder-model.onnx while rewriting.

The CLI automatically loads external data and, when needed, writes rewritten artifacts as <output>.onnx plus <output>.onnx.data.

Hybrid Quantization Workflow (Aligned with Parakeet INT4 Release)

If your goal is to reproduce the published Parakeet hybrid pipeline, use this sequence:

Rewrite compatible pointwise conv layers (Conv1D k=1 and Conv2D 1x1) to MatMul (this tool).
Quantize encoder linear + pointwise layers with int4 MatMulNBits (block_size=64, asymmetric).
Keep depthwise conv layers in FP32 (the ONNX backend manages this automatically if left unconverted).
Quantize decoder/joint with int8 dynamic quantization.

Example Step 2 (INT4 Quantization in Python):

from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer

# Must use block_size=64 and is_symmetric=False for audio/speech models
# (like Parakeet) to avoid severe degradation in embedding Cosine Similarity.
quant = MatMulNBitsQuantizer(
    'encoder-model.preq.onnx',
    block_size=64,
    is_symmetric=False,
    accuracy_level=0
)
quant.process()

# Crucial: Save with use_external_data_format=True to de-duplicate Protobuf
# serialization overhead. This saves an extra ~20MB compared to a single file.
quant.model.save_model_to_file('encoder-model.int4.onnx', use_external_data_format=True)

Reference release and details:

https://huggingface.co/efederici/parakeet-tdt-0.6b-v3-onnx-int4

Important: this repository covers step 1 (pre-quantization rewrite). The final 409 MB hybrid artifact depends on the downstream quantization stack and settings used in steps 2-4.

Why this matters

MatMulBnb4/NF4 and MatMulNBits are not equivalent quantization paths. If you compare with the wrong quantizer family, you can get size and coverage numbers that do not match the published hybrid result.

Compare package size after full pipeline

After you have produced final artifacts (encoder-model.int4.onnx, decoder_joint-model.int8.onnx, optional nemo128.int8.onnx, plus vocab.txt and config.json), you can measure bundle size:

python - <<'PY'
from pathlib import Path

files = [
    Path("encoder-model.int4.onnx"),
    Path("decoder_joint-model.int8.onnx"),
    Path("nemo128.int8.onnx"),
    Path("vocab.txt"),
    Path("config.json"),
]

existing = [p for p in files if p.exists()]
total = sum(p.stat().st_size for p in existing)

print("files used:")
for p in existing:
    print(f"- {p.name}: {p.stat().st_size / (1024 * 1024):.2f} MB")
print("---")
print(f"bundle total: {total / (1024 * 1024):.2f} MB")
PY

Useful Options

Enable extended guarded mode:

Use this when you want to convert more Conv1x1 layers than strict mode. It enables extra safe patterns (for example explicit padding or stride > 1), while keeping guardrails: unsupported or risky layers are skipped, not forced.

onnx-conv2matmul input.onnx output.onnx --extended-conv1x1

Allow non-unit dilation (explicit opt-in):

Enable this only if your model uses dilated Conv1x1 and you want those layers to be considered for rewrite too. It is opt-in because dilation can increase conversion risk; keep it off unless you need that extra coverage.

onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation

Set a max dilation guardrail:

onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation --max-dilation 4

Write a detailed JSON report:

onnx-conv2matmul input.onnx output.onnx --report-json report.json

Print JSON report to stdout:

onnx-conv2matmul input.onnx output.onnx --report-json-stdout

Skip ONNX checker (useful for very large/external-data models):

onnx-conv2matmul input.onnx output.onnx --skip-checker

Compatibility Rules

Strict mode rewrites only conservative cases:

Conv with constant pointwise weights of shape [M, C, 1] (Conv1D) or [M, C, 1, 1] (Conv2D)
group=1
unit stride, unit dilation, zero explicit pads, auto_pad=NOTSET
Optional bias is supported only when constant

Extended mode (--extended-conv1x1) also supports compatible pointwise conv with explicit padding and/or stride > 1, with guardrails.

All unsupported Conv nodes are left unchanged and reported with a reason.

Project details

Release history Release notifications | RSS feed

0.1.1

Apr 17, 2026

This version

0.1.0

Apr 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onnx_conv2matmul-0.1.0.tar.gz (12.7 kB view details)

Uploaded Apr 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onnx_conv2matmul-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Apr 16, 2026 Python 3

File details

Details for the file onnx_conv2matmul-0.1.0.tar.gz.

File metadata

Download URL: onnx_conv2matmul-0.1.0.tar.gz
Upload date: Apr 16, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for onnx_conv2matmul-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3dbb51bef1c0a0e5dd644bfd22402c22f088605875a8e9bfdc587fd69ba3fdad`
MD5	`74690f01e645c8aaac2bffb75014c61d`
BLAKE2b-256	`4a6b22fc5f4b0d81e9d03b39815acc7300428aee11127d2761a98ab13dfa6b28`

See more details on using hashes here.

File details

Details for the file onnx_conv2matmul-0.1.0-py3-none-any.whl.

File metadata

Download URL: onnx_conv2matmul-0.1.0-py3-none-any.whl
Upload date: Apr 16, 2026
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for onnx_conv2matmul-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b1a3a5190e14bfb1d60902b37e3953bbcdd26faa0aa4504f8df6c08c794956d`
MD5	`ab1c992d64e2a6ef3a6407f5b0234022`
BLAKE2b-256	`920fef3a07a0f3a80a4f07dcc031d06696f1417cb50896a05eac4e138d032e63`

See more details on using hashes here.

onnx-conv2matmul 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

onnx-conv2matmul

Install

Basic Usage

Parakeet Example (Pre-Quantization)

Hugging Face FP32 -> Pre-Quantization Workflow

Hybrid Quantization Workflow (Aligned with Parakeet INT4 Release)

Why this matters

Compare package size after full pipeline

Useful Options

Compatibility Rules

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes