Skip to main content

Quantization Techniques

Project description

FMS Model Optimizer

Lint Tests Build Minimum Python Version Release License

Introduction

FMS Model Optimizer is a framework for developing reduced precision neural network models. Quantization techniques, such as quantization-aware-training (QAT), post-training quantization (PTQ), and several other optimization techniques on popular deep learning workloads are supported.

Highlights

  • Python API to enable model quantization: With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
  • Robust: Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection/LLMs.
  • Flexible: Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
  • State-of-the-art INT and FP quantization techniques for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
  • Supports key compute-intensive operations like Conv2d, Linear, LSTM, MM and BMM

Supported Models

GPTQ FP8 PTQ QAT
Granite :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Llama :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Mixtral :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
BERT/Roberta :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

Note: Direct QAT on LLMs is not recommended

Getting Started

Requirements

  1. 🐧 Linux system with Nvidia GPU (V100/A100/H100)
  2. Python 3.10 to Python 3.12
  3. CUDA >=12

Optional packages based on optimization functionality required:

  • GPTQ is a popular compression method for LLMs:
  • If you want to experiment with INT8 deployment in QAT and PTQ examples:
    • Nvidia GPU with compute capability > 8.0 (A100 family or higher)
    • Option 1:
      • Ninja
      • Clone the CUTLASS repository
      • PyTorch 2.3.1 (as newer version will cause issue for the custom CUDA kernel used in these examples)
    • Option 2:
      • use triton kernel included. But this kernel is currently not faster than FP16.
  • FP8 is a reduced precision format like INT8:
  • To enable compute graph plotting function (mostly for troubleshooting purpose):

[!NOTE] PyTorch version should be < 2.4 if you would like to experiment deployment with external INT8 kernel.

Installation

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate

[!TIP] If you use pyenv, Conda Miniforge or other such tools for Python version management, create the virtual environment with that tool instead of venv. Otherwise, you may have issues with installed packages not being found as they are linked to your Python version management tool and not venv.

There are 2 ways to install the FMS Model Optimizer as follows:

From Release

To install from release (PyPi package):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
pip install fms-model-optimizer

From Source

To install from source(GitHub Repository):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
git clone https://github.com/foundation-model-stack/fms-model-optimizer
cd fms-model-optimizer
pip install -e .

Optional Dependencies

The following optional dependencies are available:

  • fp8: llmcompressor package for fp8 quantization
  • gptq: GPTQModel package for W4A16 quantization
  • mx: microxcaling package for MX quantization
  • opt: Shortcut for fp8, gptq, and mx installs
  • aiu: ibm-fms package for AIU model deployment
  • torchvision: torch package for image recognition training and inference
  • triton: triton package for matrix multiplication kernels
  • examples: Dependencies needed for examples
  • visualize: Dependencies for visualizing models and performance data
  • test: Dependencies needed for unit testing
  • dev: Dependencies needed for development

To install an optional dependency, modify the pip install commands above with a list of these names enclosed in brackets. The example below installs llm-compressor and torchvision with FMS Model Optimizer:

pip install fms-model-optimizer[fp8,torchvision]

pip install -e .[fp8,torchvision]

If you have already installed FMS Model Optimizer, then only the optional packages will be installed.

Try It Out!

To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:

  • Jupyter notebook tutorials (It is recommended to begin here):
    • Quantization tutorial:
      • Visualizes a random Gaussian tensor step-by-step along the quantization process
      • Build a quantizer and quantized convolution module based on this process
  • Python script examples

Docs

Dive into the design document to get a better understanding of the framework motivation and concepts.

Contributing

Check out our contributing guide to learn how to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fms_model_optimizer-0.8.2.tar.gz (5.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fms_model_optimizer-0.8.2-py3-none-any.whl (361.7 kB view details)

Uploaded Python 3

File details

Details for the file fms_model_optimizer-0.8.2.tar.gz.

File metadata

  • Download URL: fms_model_optimizer-0.8.2.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fms_model_optimizer-0.8.2.tar.gz
Algorithm Hash digest
SHA256 fffd6526dbc6e454d68ab424755b4b31e87205366af8d1f8af3ac7a9513b0646
MD5 cc3699f9d2f81f289cc32ceb550e3478
BLAKE2b-256 fcafe62b2b3745ef0c088f3c2edfc229bd738af5edc216443ecb0b5a0f0546cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.8.2.tar.gz:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fms_model_optimizer-0.8.2-py3-none-any.whl.

File metadata

File hashes

Hashes for fms_model_optimizer-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c145226b25e6326bea0ed4a214ab9eb11f7eda3b6b031bd1aa65bb9f150fc58
MD5 1f92a6a856bf91e988540d3520ecae66
BLAKE2b-256 feccaa91ee9fb16db2c5d1f8ff6de049422b6aaafd5f63027b48e20bda0254fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.8.2-py3-none-any.whl:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page