Skip to main content

Quantization Techniques

Project description

FMS Model Optimizer

Lint Tests Build Minimum Python Version Release License

Introduction

FMS Model Optimizer is a framework for developing reduced precision neural network models. Quantization techniques, such as quantization-aware-training (QAT), post-training quantization (PTQ), and several other optimization techniques on popular deep learning workloads are supported.

Highlights

  • Python API to enable model quantization: With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
  • Robust: Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection/LLMs.
  • Flexible: Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
  • State-of-the-art INT and FP quantization techniques for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
  • Supports key compute-intensive operations like Conv2d, Linear, LSTM, MM and BMM

Supported Models

GPTQ FP8 PTQ QAT
Granite :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Llama :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Mixtral :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
BERT/Roberta :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

Note: Direct QAT on LLMs is not recommended

Getting Started

Requirements

  1. 🐧 Linux system with Nvidia GPU (V100/A100/H100)
  2. Python 3.10 to Python 3.12
  3. CUDA >=12

Optional packages based on optimization functionality required:

  • GPTQ is a popular compression method for LLMs:
  • If you want to experiment with INT8 deployment in QAT and PTQ examples:
    • Nvidia GPU with compute capability > 8.0 (A100 family or higher)
    • Option 1:
      • Ninja
      • Clone the CUTLASS repository
      • PyTorch 2.3.1 (as newer version will cause issue for the custom CUDA kernel used in these examples)
    • Option 2:
      • use triton kernel included. But this kernel is currently not faster than FP16.
  • FP8 is a reduced precision format like INT8:
  • To enable compute graph plotting function (mostly for troubleshooting purpose):

[!NOTE] PyTorch version should be < 2.4 if you would like to experiment deployment with external INT8 kernel.

Installation

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate

[!TIP] If you use pyenv, Conda Miniforge or other such tools for Python version management, create the virtual environment with that tool instead of venv. Otherwise, you may have issues with installed packages not being found as they are linked to your Python version management tool and not venv.

There are 2 ways to install the FMS Model Optimizer as follows:

From Release

To install from release (PyPi package):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
pip install fms-model-optimizer

From Source

To install from source(GitHub Repository):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
git clone https://github.com/foundation-model-stack/fms-model-optimizer
cd fms-model-optimizer
pip install -e .

Optional Dependencies

The following optional dependencies are available:

  • fp8: llmcompressor package for fp8 quantization
  • gptq: GPTQModel package for W4A16 quantization
  • mx: microxcaling package for MX quantization
  • opt: Shortcut for fp8, gptq, and mx installs
  • aiu: ibm-fms package for AIU model deployment
  • torchvision: torch package for image recognition training and inference
  • triton: triton package for matrix multiplication kernels
  • examples: Dependencies needed for examples
  • visualize: Dependencies for visualizing models and performance data
  • test: Dependencies needed for unit testing
  • dev: Dependencies needed for development

To install an optional dependency, modify the pip install commands above with a list of these names enclosed in brackets. The example below installs llm-compressor and torchvision with FMS Model Optimizer:

pip install fms-model-optimizer[fp8,torchvision]

pip install -e .[fp8,torchvision]

If you have already installed FMS Model Optimizer, then only the optional packages will be installed.

Try It Out!

To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:

  • Jupyter notebook tutorials (It is recommended to begin here):
    • Quantization tutorial:
      • Visualizes a random Gaussian tensor step-by-step along the quantization process
      • Build a quantizer and quantized convolution module based on this process
  • Python script examples

Docs

Dive into the design document to get a better understanding of the framework motivation and concepts.

Contributing

Check out our contributing guide to learn how to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fms_model_optimizer-0.8.1.tar.gz (5.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fms_model_optimizer-0.8.1-py3-none-any.whl (361.6 kB view details)

Uploaded Python 3

File details

Details for the file fms_model_optimizer-0.8.1.tar.gz.

File metadata

  • Download URL: fms_model_optimizer-0.8.1.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fms_model_optimizer-0.8.1.tar.gz
Algorithm Hash digest
SHA256 e5a79d34fd537d1f672f5f8dcb565cc381f644f6e2393078bfa37df0f908b6e9
MD5 4166959327cb4c5fdd13b2ac91c4abf3
BLAKE2b-256 b8a48cb6e7f6856d6d86f7db4a6b3e238587c886372cb61cc39cc614c28dc9b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.8.1.tar.gz:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fms_model_optimizer-0.8.1-py3-none-any.whl.

File metadata

File hashes

Hashes for fms_model_optimizer-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0cef467a823836ed5e1dfc4500924d6e9983750a78b97232f49de350014f27cd
MD5 396ebf0e64c7ef62e1b9618cb2ca7977
BLAKE2b-256 87f549fe496a4e530d13e7a0cf24c4436c714c67dff3f8a0b628a22757d17580

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.8.1-py3-none-any.whl:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page