Shrinks the size of ONNX files by quantizing large float constants into eight bit equivalents, while leaving all calculations in floating point.

These details have not been verified by PyPI

Project links

Project description

ONNX Shrink Ray

Shrinks the size of ONNX files by quantizing large float constants into eight bit equivalents, while leaving all calculations in floating point.

Installation

The easiest way to get started is to install this package in Python using pip:

pip install onnx_shrink_ray

You can also download this repository and run the shrink.py script directly.

Usage

To reduce the size of a single file

python -m onnx_shrink_ray.shrink myfile.onnx

This will convert all of the weights in the ONNX file from 32-bit floating point to 8-bit integers, followed by a DequantizeLinear operation to linearly scale those into approximations of the original values for later calculations. The resulting ONNX file is typically less than 30% of the input's size.

To reduce the compressed size of a file

python -m onnx_shrink_ray.shrink --method "float_weights" --float_levels 256 myfile.onnx

A lot of downloads and app bundles are automatically compressed using a standard like gzip or brotli. Neural network weights often don't compress well when they're stored as floating point numbers, since there is very little repetition in the values, they're usually all slightly different from one another. If we know our model will be compressed for delivery, we can reduce the actual download size by making the weight values (which normally make up the majority of the file) easier to compress.

This tool does this by rounding all the float values in a weight array to the nearest in a limited number of quantized steps, but then storing the results back into a 32-bit floating point tensor. This means the uncompressed size on disk remains the same, but the compressed version is often several times smaller. This is because there's now only a limited number of values in each weight tensor, so there's a lot more repetition in the byte stream for the compression algorithm to take advantage of.

By default, each weight tensor is quantized to 256 levels, but since the results are stored as floating point values, you can modify this to trade off compressed file size for accuracy. For example, increasing the --float_levels argument to 1,000 can improve accuracy at the cost of a larger compressed file, whereas 100 would shrink the size, but could negatively impact quality.

What Shrink Ray does

Standard ONNX quantization is focused on converting all calculations to eight bit, which can reduce latency dramatically on some platforms. This approach can also cause accuracy problems however, and often requires some manual work to achieve the best results.

Sometimes though, the biggest problem is not speeding up the execution of a network, but reducing the size of the model data. This can be the case when a model has to be downloaded, where the size determines the loading time before it can be used, or when it's part of a mobile app bundle or other edge device with limited storage space.

The standard ONNX quantization does offer some file size benefits, but the potential impact on accuracy means it can take time and effort to achieve these savings. As an alternative, this module implements "weight-only quantization", where all calculations and activation layers are left in their initial precision, and only the weights are stored in a lower-fidelity format.

This approach has the advantage that it is much less likely to significantly impact accuracy, and so can usually be applied quickly, with no manual tweaking or fixups required. It will not speed up latency (and some of the methods may actually slow execution by a small amount) but it can offer significant file size savings.

Though this method is designed to have a minimal impact on the accuracy of the model, there are networks that may be adversely affected. The heuristic used to identify weights simply searches for constants or initializers that are larger than 16,384 elements, with the assumption that smaller constants are more likely to be non-weight parameters, and won't contribute much to the overall size of the model on disk.

Results

The initial reason for creating this project was to reduce the download size for the Moonshine models on the web, so I've done the most extensive testing on those networks. Here are the size and accuracy results when running against the LibreSpeech clean English-language dataset.

Moonshine Tiny

	WER	File Size	GZIP Size	Brotli Size	Latency
Original	4.51%	272MB	251MB	226MB	307ms
Integer Weights	4.69%	69MB	53MB	46MB	466ms
Float Weights (100 levels)	11.34%	272MB	60MB	46MB
Float Weights (256 levels)	4.69%	272MB	75MB	59MB
Float Weights (1,000 levels)	4.47%	272MB	108MB	79MB
ONNX Dynamic Quantization	30.99%	113MB	95MB	71MB

Moonshine Base

	WER	File Size	GZIP Size	Brotli Size
Original	3.29%	556MB	515MB	469MB
Integer Weights	3.28%	141MB	105MB	92MB
Float Weights (100 levels)	3.55%	556MB	120MB	94MB
Float Weights (256 levels)	3.28%	556MB	155MB	121MB
Float Weights (1,000 levels)	3.29%	556MB	217MB
ONNX Dynamic Quantization	19.06%	264MB	225MB	180MB

Notes

The compressed file sizes were calculated by checking the archive size after running tar --use-compress-program="<brotli|gzip> --best" -cvf archive.tbz <folder of model files>. The --best flag is used here to ensure the compression is as effective as possible by running multiple passes.

Latency values were calculated by running a ten second audio clip through each model on a Microsoft Surface Pro with an x86 CPU, using the moonshine_onnx.benchmark() function included in the library.

ONNX dynamic quantization results are included for reference. These are models produced by the onnxruntime.quantization.quantize_dynamic() function with default arguments. For convenience you can invoke this through the --method "integer_activations" option.

Some interesting patterns are visible:

The float weight quantization has no effect on the uncompressed file size, but dramatically decreases the compressed file size, as expected. It also has makes no statistically significant difference to the latency.
The integer weight quantization is a lot slower than float weights. This is a bit surprising, since the only difference is a DequantizeLinear operation for each weight constant, but my best guess is that the op hasn't been optimized, on this platform at least.
ONNX quantization produces models that are fast, but much less accurate. In my experience this is a common outcome, and can be fixed with some investigation into exactly where the accuracy loss is occuring, but it tends to be a time-consuming process, hence my desire for something easier when file size is the biggest obstacle.
ONNX quantization doesn't shrink the raw files as much as I'd expect. If the weights were being stored as 8-bit integers, I'd expect the file size to be the same as the integer_weights version, but they're about twice as large. I wonder if the weights are actually stored as 16-bit in this case, or if there's somehow an extra copy?
Different models can tolerate different levels of float quantization. The base model only loses a fraction of a percent at 100 levels, whereas the tiny model loses several points.
Brotli does a better job at compressing these files than gzip, though the compression process takes significantly longer in my experience. Since brotli is now widely supported by browsers, it seems like the best method to use overall.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.9

Nov 19, 2025

0.0.8

Jan 4, 2025

0.0.7

Dec 16, 2024

0.0.6

Dec 13, 2024

0.0.5

Dec 13, 2024

This version

0.0.4

Dec 13, 2024

0.0.3

Dec 13, 2024

0.0.2

Dec 13, 2024

0.0.1

Dec 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onnx_shrink_ray-0.0.4.tar.gz (14.1 kB view details)

Uploaded Dec 13, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

onnx_shrink_ray-0.0.4-py3-none-any.whl (11.4 kB view details)

Uploaded Dec 13, 2024 Python 3

File details

Details for the file onnx_shrink_ray-0.0.4.tar.gz.

File metadata

Download URL: onnx_shrink_ray-0.0.4.tar.gz
Upload date: Dec 13, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for onnx_shrink_ray-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`f74978275e6bdf5859d11ae2575c00a019b45e76031d317eb8c8beaba6899867`
MD5	`deab28f2f7d12e6ee677939ba0ea4c36`
BLAKE2b-256	`795dc90f6a0355d75682ead71857518244ef5002a688f636e5fd264e50e783ed`

See more details on using hashes here.

File details

Details for the file onnx_shrink_ray-0.0.4-py3-none-any.whl.

File metadata

Download URL: onnx_shrink_ray-0.0.4-py3-none-any.whl
Upload date: Dec 13, 2024
Size: 11.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for onnx_shrink_ray-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`92bd84b0b23e095fced8dd2e32f865642d5e7ae45359c63eae489d2b18018674`
MD5	`04b53ac5abc7e4a5e4a983733a858ffb`
BLAKE2b-256	`f77f22f73f83cb5bb80a528e67aa0735f4f5c1ff44583a99e39036b2a30169c9`

See more details on using hashes here.

onnx-shrink-ray 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ONNX Shrink Ray

Installation

Usage

To reduce the size of a single file

To reduce the compressed size of a file

What Shrink Ray does

Results

Moonshine Tiny

Moonshine Base

Notes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes