Skip to main content

A package for benchmarking the speed of different PyTorch conversion options

Project description

alma

A Python library for benchmarking PyTorch model speed for different conversion options 🚀

license discord

The motivation of alma is to make it easy for people to benchmark their models for different conversion options, e.g. eager, tracing, scripting, torch.compile, torch.export, ONNX, Tensort, etc. The library is designed to be simple to use, with benchmarking provided via a single API call, and to be easily extensible for adding new conversion options.

Beyond just benchmarking, alma is designed to be a one-stop-shop for all model conversion options, so that one can learn about the different conversion options, how to implement them, and how they affect model speed and performance.

Table of Contents

Getting Started

Installation

alma is available as a Python package.

One can install the package from python package index by running

pip install alma-torch

Alternatively, it can be installed from the root of this repository (save level as this README) by running:

pip install -e .

Docker

We recommend that you build the provided Dockerfile to ensure an easy installation of all of the system dependencies and the alma pip packages.

  1. Build the Docker Image

    bash scripts/build_docker.sh
    
  2. Run the Docker Container
    Create and start a container named alma:

    bash scripts/run_docker.sh
    
  3. Access the Running Container
    Enter the container's shell:

    docker exec -it alma bash
    
  4. Mount Your Repository
    By default, the run_docker.sh script mounts your /home directory to /home inside the container.
    If your alma repository is in a different location, update the bind mount, for example:

    -v /Users/myuser/alma:/home/alma
    

Basic usage

The core API is benchmark_model, which is used to benchmark the speed of a model for different conversion options. The usage is as follows:

from alma import benchmark_model
from alma.benchmark import BenchmarkConfig
from alma.benchmark.log import display_all_results

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Load the model
model = ...

# Load the dataloader used in benchmarking
data_loader = ...

# Set the configuration (this can also be passed in as a dict)
config = BenchmarkConfig(
    n_samples=2048,
    batch_size=64,
    device=device,  # The device to run the model on
)

# Choose with conversions to benchmark
conversions = ["EAGER", "EXPORT+EAGER"]

# Benchmark the model
results = benchmark_model(model, config, conversions, data_loader=data_loader)

# Print all results
display_all_results(results)

The results will look like this, depending on one's model, dataloader, and hardware.

EAGER results:
Device: cuda
Total elapsed time: 0.0211 seconds
Total inference time (model only): 0.0073 seconds
Total samples: 2048 - Batch size: 64
Throughput: 282395.70 samples/second


EXPORT+EAGER results:
Device: cuda
Total elapsed time: 0.0209 seconds
Total inference time (model only): 0.0067 seconds
Total samples: 2048 - Batch size: 64
Throughput: 305974.83 samples/second

Examples:

For extensive examples on how to use alma, as well as simple examples on how train a model and quantize it, see the MNIST example directory. This contains code examples for all of the different alma features, and is where one can find examples on every feature.

For a short working example on a simple Linear+ReLU, see the linear example. We also have a Jupyter notebook here.

Advanced Features and Design Decisions

alma is designed to be simple to use, with a single API call to benchmark a model for different conversion options. Below are some features we have produced and some design decisions we have made, which are all configurable by the user. For examples on how to use these features, see the MNIST example.

Implicitly initialise a data loader inside of `benchmark_model`
Rather than initializing and feeding in a data loader like in the above example, one can also just pass in a `data` tensor (with no batch dimension), and `benchmark_model` will automatically create a dataloader that produces random tensors of the same shape as the input tensor, with the batch size controlled via the `config` dictionary. This can be convenient if one does not want to create a data loader.

See here for details.

Pre-defined argparser for easy control and experimentation
We provide an argparser that allows one to easily select conversion methods by numerical index or string name. It also allows one to set the batch size, number of samples, and device easily, as well as other commonly used parameters like model weights path.

See here for details.

Graceful or fast failure
By default, `alma` will fail fast if any conversion method fails. This is because we want to know if a conversion method fails, so that we can fix it. However, if one wants to continue benchmarking other options even if a conversion method fails, one can set `fail_on_error` to False in the config dictionary. `alma` will then fail gracefully for that method. One can then access the associated error messages and full tracebacks for the failed methods from the returned object.

See here for details.

Isolated environments for each conversion method via multi-processing
By default, `alma` will run each conversion method in a separate process (one at a time), so that one can benchmark each conversion method in isolation. This ensures that each conversion method is benchmarked in a fair and isolated environment, and is relevant because some of the methods (e.g. optimum quanto) can affect the global torch state and break other methods (e.g. by overwriting tensor defaults in the C++ backend).

To disable multiprocessing, set multiprocessing to False in the config dictionary.

See here for details and discussion.

Using a dict for the config
We give the option for users to pass in a dictionary for the config, rather than a BenchmarkConfig object.

See here for details.

Device fallbacks
Certain options only work on certain hardware, and so we allow the option to gracefully move to the required device. This can be controlled via the config, and is discussed more here.
Logging, debugging, and CI integration
A lot of the conversion methods have verbose internal logging. We have opted to mostly silence those logs. However, if one wants access to those logs, one should use the `setup_logging` function and set the debugging level to `DEBUG`.

See here for details.

Conversion Options

Naming conventions

The naming convention for conversion options is to use short but descriptive names, e.g. EAGER, EXPORT+EAGER, EXPORT+TENSORRT, etc. If multiple "techniques" are used in a single conversion option, then the names are separated by a + sign in chronological order of operation. Underscores _ are used within each technique name to seperate the words for readability, e.g. EXPORT+AOT_INDUCTOR, where EXPORT and AOT_INDUCTOR are considered seperate steps. All conversion options are located in the src/alma/conversions/ directory. Within this directory:

Code

All conversion options are located in the src/alma/conversions/ directory. In this directory:

  • The options/ subdirectory contains one Python file per conversion option (or a closely related family of options, e.g. torch.compile backends).
  • The main selection logic for these options is found in select.py. This is just a glorified match-case statement that returns the forward calls of each model conversion option, which is returned to the benchmarking loop. It is that simple!

At the risk of some code duplication, we have chosen to keep the conversion options separate, so that one can easily add new conversion options without having to modify the existing ones. It also makes it easier for the user to see what conversion options are available, and to understand what each conversion option does.

Options Summary

Below is a table summarizing the currently supported conversion options and their identifiers:

ID Conversion Option
0 EAGER
1 EXPORT+EAGER
2 ONNX_CPU
3 ONNX_GPU
4 ONNX+DYNAMO_EXPORT
5 COMPILE_CUDAGRAPHS
6 COMPILE_INDUCTOR_DEFAULT
7 COMPILE_INDUCTOR_REDUCE_OVERHEAD
8 COMPILE_INDUCTOR_MAX_AUTOTUNE
9 COMPILE_INDUCTOR_EAGER_FALLBACK
10 COMPILE_ONNXRT
11 COMPILE_OPENXLA
12 COMPILE_TVM
13 EXPORT+AI8WI8_FLOAT_QUANTIZED
14 EXPORT+AI8WI8_FLOAT_QUANTIZED+RUN_DECOMPOSITION
15 EXPORT+AI8WI8_STATIC_QUANTIZED
16 EXPORT+AI8WI8_STATIC_QUANTIZED+RUN_DECOMPOSITION
17 EXPORT+AOT_INDUCTOR
18 EXPORT+COMPILE_CUDAGRAPHS
19 EXPORT+COMPILE_INDUCTOR_DEFAULT
20 EXPORT+COMPILE_INDUCTOR_REDUCE_OVERHEAD
21 EXPORT+COMPILE_INDUCTOR_MAX_AUTOTUNE
22 EXPORT+COMPILE_INDUCTOR_DEFAULT_EAGER_FALLBACK
23 EXPORT+COMPILE_ONNXRT
24 EXPORT+COMPILE_OPENXLA
25 EXPORT+COMPILE_TVM
26 NATIVE_CONVERT_AI8WI8_STATIC_QUANTIZED
27 NATIVE_FAKE_QUANTIZED_AI8WI8_STATIC
28 COMPILE_TENSORRT
29 EXPORT+COMPILE_TENSORRT
30 JIT_TRACE
31 TORCH_SCRIPT
32 OPTIMUM_QUANTO_AI8WI8
33 OPTIMUM_QUANTO_AI8WI4
34 OPTIMUM_QUANTO_AI8WI2
35 OPTIMUM_QUANTO_WI8
36 OPTIMUM_QUANTO_WI4
37 OPTIMUM_QUANTO_WI2
38 OPTIMUM_QUANTO_Wf8E4M3N
39 OPTIMUM_QUANTO_Wf8E4M3NUZ
40 OPTIMUM_QUANTO_Wf8E5M2
41 OPTIMUM_QUANTO_Wf8E5M2+COMPILE_CUDAGRAPHS

These conversion options are also all hard-coded in the alma/conversions/select.py file, which is the source of truth.

## Testing

We use pytest for testing. Simply run:

pytest

We currently don't have extensive tests, but we are working on adding more tests to ensure that the conversion options are working as expected in known environments (e.g. the Docker container).

Future work:

  • Add more conversion options. This is a work in progress, and we are always looking for more conversion options.
  • Multi-device benchmarking. Currently alma only supports single-device benchmarking, but ideally a model could be split across multiple devices.
  • Integrating conversion options beyond PyTorch, e.g. HuggingFace, JAX, llama.cpp, etc.

How to contribute:

Contributions are welcome! If you have a new conversion option, feature, or other you would like to add, so that the whole community can benefit, please open a pull request! We are always looking for new conversion options, and we are happy to help you get started with adding a new conversion option/feature!

See the CONTRIBUTING.md file for more detailed information on how to contribute.

Citation

@Misc{alma,
  title =        {Alma: PyTorch model speed benchmarking across all conversion types},
  author =       {Oscar Savolainen and Saif Haq},
  howpublished = {\url{https://github.com/saifhaq/alma}},
  year =         {2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alma_torch-0.2.6.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alma_torch-0.2.6-py3-none-any.whl (58.2 kB view details)

Uploaded Python 3

File details

Details for the file alma_torch-0.2.6.tar.gz.

File metadata

  • Download URL: alma_torch-0.2.6.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for alma_torch-0.2.6.tar.gz
Algorithm Hash digest
SHA256 f2f7480a0b4183d64b80993297b7e2a824069b6c3d0db1be9b0063e5dcf0d4b5
MD5 1f63518bdaffc21d51288a27bad9fb4e
BLAKE2b-256 8489a60d920c61f3b74994d04f65de778322d4da582c7b1e02c275fe523adacb

See more details on using hashes here.

File details

Details for the file alma_torch-0.2.6-py3-none-any.whl.

File metadata

  • Download URL: alma_torch-0.2.6-py3-none-any.whl
  • Upload date:
  • Size: 58.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for alma_torch-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8e55bbab8e9dae1c2f336d122d9d2efc6a212740b94e4c26ea73f2eaa495e428
MD5 5a8227b0e5eeb07c78b8109e20e77aab
BLAKE2b-256 c48601e08f32b52f04c514de6c03d52397d8855efd2c5afff7d99ac7e56e6745

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page