Package for applying ao techniques to GPU models

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

torchao: PyTorch Architecture Optimization

Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue

Introduction

torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels.

The library provides

Support for lower precision dtypes such as nf4, uint4 that are torch.compile friendly
Quantization algorithms such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile.
Sparsity algorithms such as Wanda that help improve accuracy of sparse networks
Integration with other PyTorch native libraries like torchtune and ExecuTorch

Key Features

Native PyTorch techniques, composable with torch.compile
High level autoquant API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
Quantization techniques and kernels that work with both eager and torch.compile
- Int8 dynamic activation quantization
- Int8 and int4 weight-only quantization
- Int8 dynamic activation quantization with int4 weight quantization
- GPTQ and Smoothquant

Interoperability with PyTorch Libraries

torchao has been integrated with other repositories to ease usage

torchtune is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ.
Executorch is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization.

Success stories

Our kernels have has been used to achieve SOTA inference performance on

Image segmentation models with sam-fast
Language models with gpt-fast
Diffusion models with sd-fast

Installation

Note: this library makes liberal use of several new features in pytorch, its recommended to use it with the current pytorch nightly if you want full feature coverage. If not, the subclass APIs may not work, though the module swap api's will still work.

From PyPI:

pip install torchao

From Source:

git clone https://github.com/pytorch-labs/ao
cd ao
pip install -e .

Our Goals

torchao embodies PyTorch’s design philosophy details, especially "usability over everything else". Our vision for this repository is the following:

Composability: Native solutions for optimization techniques that compose with both torch.compile and FSDP
- For example, for QLoRA for new dtypes support
Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.

Examples

Typically quantization algorithms will have different schemes for how the activation and weights are quantized so A16W8 for instance means the activations are quantized to 16 bits wheras the weights are quantized to 8 bits. Trying out different quantization schemes in torchao is generally a 1 line change.

Autoquantization

The autoquant api can be used to quickly and accurately quantize your model. When used as in the example below, the api first identifies the shapes of the activations that the different linear layers see, it then benchmarks these shapes across different types of quantized and non-quantized layers in order to pick the fastest one, attempting to take into account fusions where possible. Finally once the best class is found for each layer, it swaps the linear. Currently this api chooses between no quantization, int8 dynamic quantization and int8 weight only quantization for each layer.

import torch
import torchao

# inductor settings which improve torch.compile performance for quantized modules
torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

# Plug in your model and example input
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')

# perform autoquantization
torchao.autoquant(model, (input))

# compile the model to improve performance
model = torch.compile(model, mode='max-autotune')
model(input)

A8W8 Dynamic Quantization

# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
torch._inductor.config.force_fuse_int_mm_with_mul = True
from torchao.quantization import quant_api
# convert linear modules to quantized tensor subclasses
quant_api.change_linear_weights_to_int8_dqtensors(model)

A16W8 WeightOnly Quantization

from torchao.quantization import quant_api
quant_api.change_linear_weights_to_int8_woqtensors(model)

This technique works best when the torch._inductor.config.use_mixed_mm option is enabled. This avoids dequantizing the weight tensor before the matmul, instead fusing the dequantization into the matmul, thereby avoiding materialization of a large floating point weight tensor.

A16W4 WeightOnly Quantization

from torchao.quantization import quant_api
quant_api.change_linear_weights_to_int4_woqtensors(model)

Note: The quantization error incurred by applying int4 quantization to your model can be fairly significant, so using external techniques like GPTQ may be necessary to obtain a usable model.

A8W8 Dynamic Quantization with Smoothquant

We've also implemented a version of smoothquant with the same GEMM format as above. Due to requiring calibration, the API is more complicated.

Example

import torch
from torchao.quantization.smoothquant import swap_linear_with_smooth_fq_linear, smooth_fq_linear_to_inference

# Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
torch._inductor.config.force_fuse_int_mm_with_mul = True

# plug in your model
model = get_model()

# convert linear modules to smoothquant
# linear module in calibration mode
swap_linear_with_smooth_fq_linear(model)

# Create a data loader for calibration
calibration_data = get_calibration_data()
calibration_dataset = MyDataset(calibration_data)
calibration_loader = DataLoader(calibration_dataset, batch_size=32, shuffle=True)

# Calibrate the model
model.train()
for batch in calibration_loader:
    inputs = batch
    model(inputs)

# set it to inference mode
smooth_fq_linear_to_inference(model)

# compile the model to improve performance
model = torch.compile(model, mode='max-autotune')
model(input)

Notes

APIs have been hardware tested on A100 and T4(colab)
While these techniques are designed to improve model performance, in some cases the opposite can occur. This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance.
Use the PyTorch nightlies so you can leverage tensor subclasses which is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible.

License

torchao is released under the BSD 3 license.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2024.5.19

May 19, 2024

2024.5.17

May 17, 2024

2024.4.26

Apr 26, 2024

2024.4.25

Apr 25, 2024

2024.4.24

Apr 24, 2024

2024.4.23

Apr 23, 2024

2024.4.22

Apr 22, 2024

2024.4.21

Apr 21, 2024

2024.4.20

Apr 20, 2024

2024.4.19

Apr 19, 2024

This version

2024.4.18

Apr 18, 2024

2024.4.17

Apr 17, 2024

2024.4.16

Apr 16, 2024

2024.4.15

Apr 15, 2024

2024.4.14

Apr 14, 2024

2024.4.13

Apr 13, 2024

2024.4.12

Apr 12, 2024

2024.4.11

Apr 11, 2024

2024.4.10

Apr 10, 2024

2024.4.9

Apr 9, 2024

2024.4.8

Apr 8, 2024

2024.4.7

Apr 7, 2024

2024.4.6

Apr 6, 2024

2024.4.5

Apr 5, 2024

2024.4.4

Apr 4, 2024

2024.4.3

Apr 3, 2024

2024.4.2

Apr 2, 2024

2024.4.1

Apr 1, 2024

2024.3.31

Mar 31, 2024

2024.3.30

Mar 30, 2024

2024.3.29

Mar 29, 2024

2024.3.28

Mar 28, 2024

2024.3.27

Mar 27, 2024

2024.3.26

Mar 26, 2024

2024.3.25

Mar 25, 2024

2024.3.24

Mar 24, 2024

2024.3.23

Mar 23, 2024

2024.3.22

Mar 22, 2024

2024.3.21

Mar 21, 2024

2024.3.20

Mar 20, 2024

2024.3.19

Mar 19, 2024

2024.3.18

Mar 18, 2024

2024.3.17

Mar 17, 2024

2024.3.16

Mar 16, 2024

2024.3.15

Mar 15, 2024

2024.3.14

Mar 14, 2024

2024.3.13

Mar 13, 2024

2024.3.7

Mar 8, 2024

0.0.3

Mar 8, 2024

0.0.3.dev20240313 pre-release

Mar 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torchao_nightly-2024.4.18.tar.gz (70.6 kB view hashes)

Uploaded Apr 18, 2024 Source

Built Distribution

torchao_nightly-2024.4.18-py3-none-any.whl (80.7 kB view hashes)

Uploaded Apr 18, 2024 Python 3

Hashes for torchao_nightly-2024.4.18.tar.gz

Hashes for torchao_nightly-2024.4.18.tar.gz
Algorithm	Hash digest
SHA256	`3fdf319e62bfa59007bccfea99bfda21c2750d844f6ac18005041864a95a872d`
MD5	`6c977f5b911cdbe72357f5ba4d918183`
BLAKE2b-256	`7f57288a5e3ca68e3033d57b75465890d30fbc47dbb0a81cf656391b05e18c07`

Hashes for torchao_nightly-2024.4.18-py3-none-any.whl

Hashes for torchao_nightly-2024.4.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b41a9dc851598620828d5e60349978bd5ffb0283d383b719cacb568df3eae22d`
MD5	`3d8af3caf776f48398fbb74d8bde493a`
BLAKE2b-256	`80706c9a581b284bd09684b01ed0343de17c102c2fa48a6f92ebe6c80107c270`