Skip to main content

Quantization Techniques

Project description

FMS Model Optimizer

Lint Tests Build Minimum Python Version Release License

Introduction

FMS Model Optimizer is a framework for developing reduced precision neural network models. Quantization techniques, such as quantization-aware-training (QAT), post-training quantization (PTQ), and several other optimization techniques on popular deep learning workloads are supported.

Highlights

  • Python API to enable model quantization: With the addition of a few lines of codes, module-level and/or function-level operations replacement will be performed.
  • Robust: Verified for INT 8/4-bit quantization on important vision/speech/NLP/object detection/LLMs.
  • Flexible: Options to analyze the network using PyTorch Dynamo, apply best practices, such as clip_val initialization, layer-level precision setting, optimizer param group setting, etc. during quantization.
  • State-of-the-art INT and FP quantization techniques for weights and activations, such as SmoothQuant, SAWB+ and PACT+.
  • Supports key compute-intensive operations like Conv2d, Linear, LSTM, MM and BMM

Supported Models

GPTQ FP8 PTQ QAT
Granite :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Llama :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
Mixtral :white_check_mark: :white_check_mark: :white_check_mark: :black_square_button:
BERT/Roberta :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

Note: Direct QAT on LLMs is not recommended

Getting Started

Requirements

  1. 🐧 Linux system with Nvidia GPU (V100/A100/H100)
  2. Python 3.10 to Python 3.12
  3. CUDA >=12

Optional packages based on optimization functionality required:

  • GPTQ is a popular compression method for LLMs:
  • If you want to experiment with INT8 deployment in QAT and PTQ examples:
    • Nvidia GPU with compute capability > 8.0 (A100 family or higher)
    • Option 1:
      • Ninja
      • Clone the CUTLASS repository
      • PyTorch 2.3.1 (as newer version will cause issue for the custom CUDA kernel used in these examples)
    • Option 2:
      • use triton kernel included. But this kernel is currently not faster than FP16.
  • FP8 is a reduced precision format like INT8:
  • To enable compute graph plotting function (mostly for troubleshooting purpose):

[!NOTE] PyTorch version should be < 2.4 if you would like to experiment deployment with external INT8 kernel.

Installation

We recommend using a Python virtual environment with Python 3.9+. Here is how to setup a virtual environment using Python venv:

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate

[!TIP] If you use pyenv, Conda Miniforge or other such tools for Python version management, create the virtual environment with that tool instead of venv. Otherwise, you may have issues with installed packages not being found as they are linked to your Python version management tool and not venv.

There are 2 ways to install the FMS Model Optimizer as follows:

From Release

To install from release (PyPi package):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
pip install fms-model-optimizer

From Source

To install from source(GitHub Repository):

python3 -m venv fms_mo_venv
source fms_mo_venv/bin/activate
git clone https://github.com/foundation-model-stack/fms-model-optimizer
cd fms-model-optimizer
pip install -e .

Optional Dependencies

The following optional dependencies are available:

  • fp8: llmcompressor package for fp8 quantization
  • gptq: GPTQModel package for W4A16 quantization
  • mx: microxcaling package for MX quantization
  • opt: Shortcut for fp8, gptq, and mx installs
  • aiu: ibm-fms package for AIU model deployment
  • torchvision: torch package for image recognition training and inference
  • triton: triton package for matrix multiplication kernels
  • examples: Dependencies needed for examples
  • visualize: Dependencies for visualizing models and performance data
  • test: Dependencies needed for unit testing
  • dev: Dependencies needed for development

To install an optional dependency, modify the pip install commands above with a list of these names enclosed in brackets. The example below installs llm-compressor and torchvision with FMS Model Optimizer:

pip install fms-model-optimizer[fp8,torchvision]

pip install -e .[fp8,torchvision]

If you have already installed FMS Model Optimizer, then only the optional packages will be installed.

Try It Out!

To help you get up and running as quickly as possible with the FMS Model Optimizer framework, check out the following resources which demonstrate how to use the framework with different quantization techniques:

  • Jupyter notebook tutorials (It is recommended to begin here):
    • Quantization tutorial:
      • Visualizes a random Gaussian tensor step-by-step along the quantization process
      • Build a quantizer and quantized convolution module based on this process
  • Python script examples

Docs

Dive into the design document to get a better understanding of the framework motivation and concepts.

Contributing

Check out our contributing guide to learn how to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fms_model_optimizer-0.5.0.tar.gz (5.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fms_model_optimizer-0.5.0-py3-none-any.whl (284.3 kB view details)

Uploaded Python 3

File details

Details for the file fms_model_optimizer-0.5.0.tar.gz.

File metadata

  • Download URL: fms_model_optimizer-0.5.0.tar.gz
  • Upload date:
  • Size: 5.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fms_model_optimizer-0.5.0.tar.gz
Algorithm Hash digest
SHA256 e38e09a483f4f23fe7ee7173b539c5ecba70cd8a2019be461780545139159751
MD5 e2436753559b02afa0bded538849224f
BLAKE2b-256 76259e3744e95bf15dd02b6e403a34db17ae53441c20608ff7b8e92779e12313

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.5.0.tar.gz:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fms_model_optimizer-0.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fms_model_optimizer-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 114ce5e67d649fcbc45832ca94ad558ac75f9d5119010a9f1e1606d72bde6a2a
MD5 b063548292abc6f7860a0d940b0764df
BLAKE2b-256 e5e4135088a0ede812e3f409d0e2635452d240239f894aace0fcbda9b71e7b7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fms_model_optimizer-0.5.0-py3-none-any.whl:

Publisher: pypi.yml on foundation-model-stack/fms-model-optimizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page