Skip to main content

Shrinks the size of ONNX files by quantizing large float constants into eight bit equivalents, while leaving all calculations in floating point.

Project description

ONNX Shrink Ray

Shrinks the size of ONNX files by quantizing large float constants into eight bit equivalents, while leaving all calculations in floating point.

Installation

The easiest way to get started is to install this package in Python using pip:

pip install onnx_shrink_ray

You can also download this repository and run the shrink.py script directly.

Usage

To reduce the size of a single file

python -m onnx_shrink_ray.shrink myfile.onnx

This will convert all of the weights in the ONNX file from 32-bit floating point to 8-bit integers, followed by a DequantizeLinear operation to linearly scale those into approximations of the original values for later calculations. The resulting ONNX file is typically less than 30% of the input's size.

To reduce the compressed size of a file

python -m onnx_shrink_ray.shrink --method "float_weights" --float_levels 256 myfile.onnx

A lot of downloads and app bundles are automatically compressed using a standard like gzip or brotli. Neural network weights often don't compress well when they're stored as floating point numbers, since there is very little repetition in the values, they're usually all slightly different from one another. If we know our model will be compressed for delivery, we can reduce the actual download size by making the weight values (which normally make up the majority of the file) easier to compress.

This tool does this by rounding all the float values in a weight array to the nearest in a limited number of quantized steps, but then storing the results back into a 32-bit floating point tensor. This means the uncompressed size on disk remains the same, but the compressed version is often several times smaller. This is because there's now only a limited number of values in each weight tensor, so there's a lot more repetition in the byte stream for the compression algorithm to take advantage of.

By default, each weight tensor is quantized to 256 levels, but since the results are stored as floating point values, you can modify this to trade off compressed file size for accuracy. For example, increasing the --float_levels argument to 1,000 can improve accuracy at the cost of a larger compressed file, whereas 100 would shrink the size, but could negatively impact quality.

What Shrink Ray does

Standard ONNX quantization is focused on converting all calculations to eight bit, which can reduce latency dramatically on some platforms. This approach can also cause accuracy problems however, and often requires some manual work to achieve the best results.

Sometimes though, the biggest problem is not speeding up the execution of a network, but reducing the size of the model data. This can be the case when a model has to be downloaded, where the size determines the loading time before it can be used, or when it's part of a mobile app bundle or other edge device with limited storage space.

The standard ONNX quantization does offer some file size benefits, but the potential impact on accuracy means it can take time and effort to achieve these savings. As an alternative, this module implements "weight-only quantization", where all calculations and activation layers are left in their initial precision, and only the weights are stored in a lower-fidelity format.

This approach has the advantage that it is much less likely to significantly impact accuracy, and so can usually be applied quickly, with no manual tweaking or fixups required. It will not speed up latency (and some of the methods may actually slow execution by a small amount) but it can offer significant file size savings.

Though this method is designed to have a minimal impact on the accuracy of the model, there are networks that may be adversely affected. The heuristic used to identify weights simply searches for constants or initializers that are larger than 16,384 elements, with the assumption that smaller constants are more likely to be non-weight parameters, and won't contribute much to the overall size of the model on disk.

Results

The initial reason for creating this project was to reduce the download size for the Moonshine models on the web, so I've done the most extensive testing on those networks. Here are the size and accuracy results when running against the LibreSpeech clean English-language dataset.

Moonshine Tiny

WER File Size GZIP Size Brotli Size Latency
Original 4.51% 272MB 251MB 226MB 307ms
Integer Weights 4.69% 69MB 53MB 46MB 466ms
Float Weights (100 levels) 11.34% 272MB 60MB 46MB
Float Weights (256 levels) 4.69% 272MB 75MB 59MB
Float Weights (1,000 levels) 4.47% 272MB 108MB 79MB
ONNX Dynamic Quantization 30.99% 113MB 95MB 71MB

Moonshine Base

WER File Size GZIP Size Brotli Size Latency
Original 3.29% 556MB 515MB 469MB
Integer Weights 3.28% 141MB 105MB 92MB
Float Weights (100 levels) 3.55% 556MB 120MB 94MB
Float Weights (256 levels) 3.28% 556MB 155MB 121MB
Float Weights (1,000 levels) 3.29% 556MB 217MB
ONNX Dynamic Quantization 19.06% 264MB 225MB 180MB

Notes

The compressed file sizes were calculated by checking the archive size after running tar --use-compress-program="<brotli|gzip> --best" -cvf archive.tbz <folder of model files>. The --best flag is used here to ensure the compression is as effective as possible by running multiple passes.

Latency values were calculated by running a ten second audio clip through each model on a Microsoft Surface Pro with an x86 CPU, using the moonshine_onnx.benchmark() function included in the library.

ONNX dynamic quantization results are included for reference. These are models produced by the onnxruntime.quantization.quantize_dynamic() function with default arguments. For convenience you can invoke this through the --method "integer_activations" option.

Some interesting patterns are visible:

  • The float weight quantization has no effect on the uncompressed file size, but dramatically decreases the compressed file size, as expected. It also has makes no statistically significant difference to the latency.

  • The integer weight quantization is a lot slower than float weights. This is a bit surprising, since the only difference is a DequantizeLinear operation for each weight constant, but my best guess is that the op hasn't been optimized, on this platform at least.

  • ONNX quantization produces models that are fast, but much less accurate. In my experience this is a common outcome, and can be fixed with some investigation into exactly where the accuracy loss is occuring, but it tends to be a time-consuming process, hence my desire for something easier when file size is the biggest obstacle.

  • ONNX quantization doesn't shrink the raw files as much as I'd expect. If the weights were being stored as 8-bit integers, I'd expect the file size to be the same as the integer_weights version, but they're about twice as large. I wonder if the weights are actually stored as 16-bit in this case, or if there's somehow an extra copy?

  • Different models can tolerate different levels of float quantization. The base model only loses a fraction of a percent at 100 levels, whereas the tiny model loses several points.

  • Brotli does a better job at compressing these files than gzip, though the compression process takes significantly longer in my experience. Since brotli is now widely supported by browsers, it seems like the best method to use overall.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onnx_shrink_ray-0.0.4.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onnx_shrink_ray-0.0.4-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file onnx_shrink_ray-0.0.4.tar.gz.

File metadata

  • Download URL: onnx_shrink_ray-0.0.4.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for onnx_shrink_ray-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f74978275e6bdf5859d11ae2575c00a019b45e76031d317eb8c8beaba6899867
MD5 deab28f2f7d12e6ee677939ba0ea4c36
BLAKE2b-256 795dc90f6a0355d75682ead71857518244ef5002a688f636e5fd264e50e783ed

See more details on using hashes here.

File details

Details for the file onnx_shrink_ray-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for onnx_shrink_ray-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 92bd84b0b23e095fced8dd2e32f865642d5e7ae45359c63eae489d2b18018674
MD5 04b53ac5abc7e4a5e4a983733a858ffb
BLAKE2b-256 f77f22f73f83cb5bb80a528e67aa0735f4f5c1ff44583a99e39036b2a30169c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page