Skip to main content

A quantizer for advanced developers to quantize converted AI Edge models.

Project description

AI Edge Quantizer

A quantizer for advanced developers to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).

Build Status

Build Type Status
Unit Tests (Linux)
Nightly Release
Nightly Colab

Installation

Requirements and Dependencies

  • Python versions: 3.10, 3.11, 3.12, 3.13
  • Operating system: Linux, MacOS
  • TensorFlow: tf-nightly

Install

Nightly PyPi package:

pip install ai-edge-quantizer-nightly

API Usage

The quantizer requires two inputs:

  1. An unquantized source LiteRT model (with FP32 data type in the FlatBuffer format with .tflite extension)
  2. A quantization recipe (details below)

and outputs a quantized LiteRT model that's ready for deployment on edge devices.

Basic Usage

In a nutshell, the quantizer works according to the following steps:

  1. Instantiate a Quantizer class. This is the entry point to the quantizer's functionalities that the user accesses.
  2. Load a desired quantization recipe (details in subsection).
  3. Quantize (and save) the model. This is where most of the quantizer's internal logic works.
qt = quantizer.Quantizer("path/to/input/tflite")
qt.load_quantization_recipe(recipe.dynamic_wi8_afp32())
qt.quantize().export_model("/path/to/output/tflite")

Please see the getting started colab for the simplest quick start guide on those 3 steps, and the selective quantization colab with more details on advanced features.

LiteRT Model

Please refer to the LiteRT documentation for ways to generate LiteRT models from Jax, PyTorch and TensorFlow. The input source model should be an FP32 (unquantized) model in the FlatBuffer format with .tflite extension.

Quantization Recipe

The user needs to specify a quantization recipe using AI Edge Quantizer's API to apply to the source model. The quantization recipe encodes all information on how a model is to be quantized, such as number of bits, data type, symmetry, scope name, etc.

Essentially, a quantization recipe is defined as a collection of commands of the following type:

“Apply Quantization Algorithm X on Operator Y under Scope Z with ConfigN”.

For example:

"Uniformly quantize the FullyConnected op under scope 'dense1/' with INT8 symmetric with Dynamic Quantization".

All the unspecified ops will be kept as FP32 (unquantized). The scope of an operator in TFLite is defined as the output tensor name of the op, which preserves the hierarchical model information from the source model (e.g., scope in TF). The best way to obtain scope name is by visualizing the model with Model Explorer.

Currently, there are three ways to quantize an operator:

  • dynamic quantization (recommended): weights are quantized while activations remain in a float format and are not processed by AI Edge Quantizer (AEQ). The runtime kernel handles the on-the-fly quantization of these activations, as identified by compute_precision=integer and explicit_dequantize=False.

    • Pros: reduced model size and memory usage. Latency improvement due to integer computation. No sample data requirement (calibration).
    • Cons: on-the-fly quantization of activation tensors can affect model quality. Not supported in all hardware (e.g., some GPU and NPU).
  • weight only quantization: only model weights are quantized, not activations. The actual operation (op) computation remains in float. The quantized weight is explicitly dequantized before being fed into the op, by inserting a dequantize op between the quantized weight and the consuming op. To enable this, compute_precision will be set to float and explicit_dequantize to True.

    • Pros: reduced model size and memory usage. No sample data requirement (calibration). Usually has the best model quality.
    • Cons: no latency benefit (may be worse) due to float computation with explicit dequantization.
  • static quantization: both weights and activations are quantized. This requires a calibration phase to estimate quantization parameters of runtime tensors (activations).

    • Pros: reduced model size, memory usage, and latency.
    • Cons: requires sample data for calibration. Imposing static quantization parameters (derived from calibration) on runtime tensors can compromise quality.

Generally, we recommend dynamic quantization for CPU/GPU deployment and static quantization for NPU deployment.

We include commonly used recipes in recipe.py. This is demonstrated in the getting started colab example. Advanced users can build their own recipe through the quantizer API.

Deployment

Please refer to the LiteRT deployment documentation for ways to deploy a quantized LiteRT model.

Advanced Recipes

There are many ways the user can configure and customize the quantization recipe beyond using a template in recipe.py. For example, the user can configure the recipe to achieve these features:

  • Selective quantization (exclude selected ops from being quantized)
  • Flexible mixed scheme quantization (mixture of different precision, compute precision, scope, op, config, etc)
  • 4-bit weight quantization

The selective quantization colab shows some of these more advanced features.

For specifics of the recipe schema, please refer to the OpQuantizationRecipe in [recipe_manager.py].

For advanced usage involving mixed quantization, the following API may be useful:

  • Use Quantizer:load_quantization_recipe() in quantizer.py to load a custom recipe.
  • Use Quantizer:update_quantization_recipe() in quantizer.py to extend or override specific parts of the recipe.

Operator coverage

The table below outlines the allowed configurations for available recipes.

Config DYNAMIC_WI8_AFP32 DYNAMIC_WI4_AFP32 STATIC_WI8_AI16 STATIC_WI4_AI16 STATIC_WI8_AI8 STATIC_WI4_AI8 WEIGHTONLY_WI8_AFP32 WEIGHTONLY_WI4_AFP32
activation num_bits None None 16 16 8 8 None None
symmetric None None TRUE TRUE [TRUE, FALSE] [TRUE, FALSE] None None
granularity None None TENSORWISE TENSORWISE TENSORWISE TENSORWISE None None
dtype None None INT INT INT INT None None
weight num_bits 8 4 8 4 8 4 8 4
symmetric TRUE TRUE TRUE TRUE TRUE TRUE [TRUE, FALSE] [TRUE, FALSE]
granularity [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE] [CHANNELWISE, TENSORWISE]
dtype INT INT INT INT INT INT INT INT
explicit_dequantize FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
compute_precision INTEGER INTEGER INTEGER INTEGER INTEGER INTEGER FLOAT FLOAT

Operators Supporting Quantization

Config DYNAMIC_WI8_AFP32 DYNAMIC_WI4_AFP32 STATIC_WI8_AI16 STATIC_WI4_AI16 STATIC_WI8_AI8 STATIC_WI4_AI8 WEIGHTONLY_WI8_AFP32 WEIGHTONLY_WI4_AFP32
FULLY_CONNECTED
CONV_2D
BATCH_MATMUL
EMBEDDING_LOOKUP
DEPTHWISE_CONV_2D
AVERAGE_POOL_2D
RESHAPE
SOFTMAX
TANH
TRANSPOSE
GELU
ADD
CONV_2D_TRANSPOSE
SUB
MUL
MEAN
RSQRT
CONCATENATION
STRIDED_SLICE
SPLIT
LOGISTIC
SLICE
SELECT
SELECT_V2
SUM
PAD
PADV2
MIRROR_PAD
SQUARED_DIFFERENCE
MAX_POOL_2D
RESIZE_BILINEAR
RESIZE_NEAREST_NEIGHBOR
GATHER_ND
PACK
UNPACK
DIV
SQRT
GATHER
HARD_SWISH
MAXIMUM
REDUCE_MIN
EQUAL
NOT_EQUAL
SPACE_TO_DEPTH
RELU

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai_edge_quantizer_nightly-0.5.1.dev20260408.tar.gz (202.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file ai_edge_quantizer_nightly-0.5.1.dev20260408.tar.gz.

File metadata

  • Download URL: ai_edge_quantizer_nightly-0.5.1.dev20260408.tar.gz
  • Upload date:
  • Size: 202.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ai_edge_quantizer_nightly-0.5.1.dev20260408.tar.gz
Algorithm Hash digest
SHA256 aff10d769b9e7fb94e228f08c9ed945aeeb5211394ea867c578d4f4341049218
MD5 c0da26acaf3e28effa1c34cd1ff9f5f5
BLAKE2b-256 4e6ae7d253e400aca375514112626713ae69f599e7e503a525c1f57a4277bfc9

See more details on using hashes here.

File details

Details for the file ai_edge_quantizer_nightly-0.5.1.dev20260408-py3-none-any.whl.

File metadata

  • Download URL: ai_edge_quantizer_nightly-0.5.1.dev20260408-py3-none-any.whl
  • Upload date:
  • Size: 401.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for ai_edge_quantizer_nightly-0.5.1.dev20260408-py3-none-any.whl
Algorithm Hash digest
SHA256 3348b7d85c2bc8cdb561fc26a81308c01d9480fc86e3aa854be029f0d8a5e3bb
MD5 324ec05594ba3907af6899b2438849e5
BLAKE2b-256 027436de53b5bfae9a384806d5c10bf5ff4b54b341093ef6da9e0934d40d6215

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page