Skip to main content

Trainy: An observability tool for profiling PyTorch training on demand

Project description

Trainy on-demand profiler

GitHub Repo stars

This is the trainy CLI and daemon to setup on demand tracing for PyTorch in pure Python. This will allow you to extract traces in the middle of training.

Installation

You can either install from pypi or from source

# install from pypi
pip install trainy

# install from source
git clone https://github.com/Trainy-ai/trainy
pip install -e trainy

Quickstart

If you haven't already, set up ray head and worker nodes. This can configured to happen automatically using (Skypilot)[https://skypilot.readthedocs.io/en/latest/index.html] or K8s

# on the head node 
$ ray start --head --port 6380

# on the worker nodes
$ ray start --address ${HEAD_IP}

In your train code, initialize the trainy daemon before running your train loop.

import trainy
trainy.init()
Trainer.train()

While your model is training, to capture traces on all the nodes, run

$ trainy trace --logdir ~/my-traces

This saves the traces for each process locally into ~/my-traces. It's recommended you run a shared file system like NFS or an s3 backed store so that all of your traces are in the same place. An example of how to do this and scale this up is under the examples/resnet_mnist on AWS

How It Works

Trainy registers a hook into whatever PyTorch optimizer is present in your code, to count the optimizer iterations and registers the program with the head ray node. A separate HTTP server daemon thread is run concurrently, which waits for a trigger POST request to start profiling.

Need help

We offer support for both setting up trainy and analyzing program traces. If you are interested, please email us

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainy-0.1.3.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

trainy-0.1.3-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file trainy-0.1.3.tar.gz.

File metadata

  • Download URL: trainy-0.1.3.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for trainy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bfffb017399938567f0e4344699ea8d40271a11c90a1c90367c866b1f6cc79e2
MD5 20ddd6d2a17191e4bc5f897c54b0d87b
BLAKE2b-256 5854c6c811cd0e1c36d5024e4eaae9bb4e48f5435b712ce2481637c1015788b8

See more details on using hashes here.

File details

Details for the file trainy-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: trainy-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for trainy-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 367e0fa8940de27c462b5d8d349b884ceafbb8345d991d7c61f931f90966e254
MD5 602d97a0245be57f72bd02c4a27d23c7
BLAKE2b-256 e0f325143a7b6358497f4a12a480a9cafe491d884d5084f807561f57d4b82aa6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page