Skip to main content

Utilities to rewrite ONNX convolution patterns into matrix multiplication forms.

Project description

onnx-conv2matmul

onnx-conv2matmul is a pre-quantization rewrite tool for ONNX models.

It converts compatible Conv 1x1 layers into equivalent MatMul subgraphs so your quantization stack can treat more model weights with MatMul-focused kernels (for example int4 weight-only flows).

Why this matters in practice:

  • Many backends optimize/quantize MatMul more aggressively than Conv
  • Pointwise convs often dominate encoder blocks and are good candidates for rewrite
  • You can reduce model size and improve throughput potential without changing model semantics

The tool is conservative by default, keeps unsupported layers untouched, and emits a detailed report of converted vs skipped nodes with explicit reasons.

Install

From PyPI:

pip install onnx-conv2matmul

Basic Usage

Rewrite to a new file:

onnx-conv2matmul input.onnx output.onnx

Overwrite in-place:

onnx-conv2matmul input.onnx --inplace

Rewrite and verify in one command (CPU):

onnx-conv2matmul input.onnx output.onnx \
	--extended-conv1x1 \
	--verify \
	--verify-deterministic-cpu \
	--verify-lengths 64,80,97,128,191

Parakeet Example (Pre-Quantization)

Reference model page:

Typical pre-quantization step for the encoder graph:

onnx-conv2matmul encoder-model.onnx encoder-model.preq.onnx \
	--extended-conv1x1 \
	--allow-non-unit-dilation \
	--max-dilation 4 \
	--skip-checker \
	--report-json encoder-model.preq.report.json

This converts compatible pointwise conv layers (Conv1D k=1 and Conv2D 1x1) to MatMul and writes a JSON report with converted/skipped nodes and reasons.

Parakeet: From Original NeMo Checkpoint to Pre-Quantization

If you want to start from the true original FP32 PyTorch release:

you can export the ONNX graphs directly from the official .nemo checkpoint using the nemo_toolkit library.

1. Install NeMo:

pip install nemo_toolkit[asr]

2. Export the FP32 ONNX Encoder Graph:

import nemo.collections.asr as nemo_asr

# Download from HF and load the PyTorch NeMo model
model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v3")
model.eval()

# Export the encoder directly to match the filename convention
model.encoder.export("encoder-model.onnx")

(Note: Pre-exported ONNX repositories like istupakov/parakeet-tdt-0.6b-v3-onnx are convenient and contain also decoder/joint, but the script above ensures you are exporting the absolute official source directly from NVIDIA's PyTorch checkpoint).

3. Run pre-quantization rewrite on the exported encoder:

onnx-conv2matmul parakeet-encoder.onnx encoder-model.preq.onnx \
        --extended-conv1x1 \
        --allow-non-unit-dilation \
        --max-dilation 4 \
        --skip-checker \
        --report-json encoder-model.preq.report.json

4. Check conversion summary

cat encoder-model.preq.report.json

5. Verify I/O equivalence (robust numerical check)

The CLI includes a built-in strict CPU deterministic verification. It runs inputs through both models to ensure the maximum numerical deviation remains within safe float32 bounds (max_abs <= 3e-5, mean_abs <= 2e-6).

onnx-conv2matmul parakeet-encoder.onnx encoder-model.preq.onnx \
	--verify \
	--verify-signal-input-name audio_signal \
	--verify-length-input-name length \
	--verify-output-index 0 \
	--verify-length-output-index 1 \
	--verify-deterministic-cpu \
	--verify-lengths 64,80,97,128,191

If it prints Verification PASSED, the rewrite is numerically transparent.

Tip: the source model includes encoder-model.onnx.data as external weights; keep it in the same directory as encoder-model.onnx while rewriting.

The CLI automatically loads external data and, when needed, writes rewritten artifacts as <output>.onnx plus <output>.onnx.data.

Hybrid Quantization Workflow (Aligned with Parakeet INT4 Release)

If your goal is to reproduce the published Parakeet hybrid pipeline, use this sequence:

  1. Rewrite compatible pointwise conv layers (Conv1D k=1 and Conv2D 1x1) to MatMul (this tool).
  2. Quantize encoder linear + pointwise layers with int4 MatMulNBits (block_size=64, asymmetric).
  3. Keep depthwise conv layers in FP32 (the ONNX backend manages this automatically if left unconverted).
  4. Quantize decoder/joint with int8 dynamic quantization.

Example Step 2 (INT4 Quantization in Python):

from onnxruntime.quantization.matmul_nbits_quantizer import MatMulNBitsQuantizer

# Must use block_size=64 and is_symmetric=False for audio/speech models
# (like Parakeet) to avoid severe degradation in embedding Cosine Similarity.
quant = MatMulNBitsQuantizer(
    'encoder-model.preq.onnx',
    block_size=64,
    is_symmetric=False,
    accuracy_level=0
)
quant.process()

# Crucial: Save with use_external_data_format=True to de-duplicate Protobuf
# serialization overhead. This saves an extra ~20MB compared to a single file.
quant.model.save_model_to_file('encoder-model.int4.onnx', use_external_data_format=True)

Reference release and details:

Important: this repository covers step 1 (pre-quantization rewrite). The final 409 MB hybrid artifact depends on the downstream quantization stack and settings used in steps 2-4.

Why this matters

MatMulBnb4/NF4 and MatMulNBits are not equivalent quantization paths. If you compare with the wrong quantizer family, you can get size and coverage numbers that do not match the published hybrid result.

Compare package size after full pipeline

After you have produced final artifacts (encoder-model.int4.onnx, decoder_joint-model.int8.onnx, optional nemo128.int8.onnx, plus vocab.txt and config.json), you can measure bundle size:

python - <<'PY'
from pathlib import Path

files = [
    Path("encoder-model.int4.onnx"),
    Path("decoder_joint-model.int8.onnx"),
    Path("nemo128.int8.onnx"),
    Path("vocab.txt"),
    Path("config.json"),
]

existing = [p for p in files if p.exists()]
total = sum(p.stat().st_size for p in existing)

print("files used:")
for p in existing:
    print(f"- {p.name}: {p.stat().st_size / (1024 * 1024):.2f} MB")
print("---")
print(f"bundle total: {total / (1024 * 1024):.2f} MB")
PY

Useful Options

Enable extended guarded mode:

Use this when you want to convert more Conv1x1 layers than strict mode. It enables extra safe patterns (for example explicit padding or stride > 1), while keeping guardrails: unsupported or risky layers are skipped, not forced.

onnx-conv2matmul input.onnx output.onnx --extended-conv1x1

Allow non-unit dilation (explicit opt-in):

Enable this only if your model uses dilated Conv1x1 and you want those layers to be considered for rewrite too. It is opt-in because dilation can increase conversion risk; keep it off unless you need that extra coverage.

onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation

Set a max dilation guardrail:

onnx-conv2matmul input.onnx output.onnx --allow-non-unit-dilation --max-dilation 4

Write a detailed JSON report:

onnx-conv2matmul input.onnx output.onnx --report-json report.json

Print JSON report to stdout:

onnx-conv2matmul input.onnx output.onnx --report-json-stdout

Skip ONNX checker (useful for very large/external-data models):

onnx-conv2matmul input.onnx output.onnx --skip-checker

Compatibility Rules

Strict mode rewrites only conservative cases:

  • Conv with constant pointwise weights of shape [M, C, 1] (Conv1D) or [M, C, 1, 1] (Conv2D)
  • group=1
  • unit stride, unit dilation, zero explicit pads, auto_pad=NOTSET
  • Optional bias is supported only when constant

Extended mode (--extended-conv1x1) also supports compatible pointwise conv with explicit padding and/or stride > 1, with guardrails.

All unsupported Conv nodes are left unchanged and reported with a reason.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onnx_conv2matmul-0.1.1.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onnx_conv2matmul-0.1.1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file onnx_conv2matmul-0.1.1.tar.gz.

File metadata

  • Download URL: onnx_conv2matmul-0.1.1.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.4

File hashes

Hashes for onnx_conv2matmul-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4fb654fc650a7999c61cf45675d760a65516ebb4a6ac563af36f278eb3659d11
MD5 6428e04973c995b0f22299463dd2b082
BLAKE2b-256 bbba9d4cee5c4cf8765f749fd69fe5e60b27043c67e11c25670f70bb2dac4648

See more details on using hashes here.

File details

Details for the file onnx_conv2matmul-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for onnx_conv2matmul-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8661f7787297130da8e31e72438b91cb919a2d29cfa77813a56e6ad8f1deefcb
MD5 ecc6c2112767dca5e659defebc6ec0fc
BLAKE2b-256 df90f726ae825456983bafe44309059c92805c20f8babedf4469282d7d1c085c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page