Trainy: An observability tool for profiling PyTorch training on demand
Project description
Trainy on-demand profiler
This is the trainy CLI and daemon to setup on demand tracing for PyTorch in pure Python. This will allow you to extract traces in the middle of training.
Installation
You can either install from pypi or from source
# install from pypi
pip install trainy
# install from source
git clone https://github.com/Trainy-ai/trainy
pip install -e trainy
Quickstart
If you haven't already, set up ray head and worker nodes. This can configured to happen automatically using (Skypilot)[https://skypilot.readthedocs.io/en/latest/index.html] or K8s
# on the head node
$ ray start --head --port 6380
# on the worker nodes
$ ray start --address ${HEAD_IP}
In your train code, initialize the trainy daemon before running your train loop.
import trainy
trainy.init()
Trainer.train()
While your model is training, to capture traces on all the nodes, run
$ trainy trace --logdir ~/my-traces
This saves the traces for each process locally into ~/my-traces
. It's recommended
you run a shared file system like NFS or an s3 backed store so that all of your traces
are in the same place. An example of how to do this and scale this up is under the examples/resnet_mnist
on AWS
How It Works
Trainy registers a hook into whatever PyTorch optimizer is present in your code, to count the optimizer iterations and registers the program with the head ray node. A separate HTTP server daemon thread is run concurrently, which waits for a trigger POST request to start profiling.
Need help
We offer support for both setting up trainy and analyzing program traces. If you are interested, please email us
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file trainy-0.1.1.tar.gz
.
File metadata
- Download URL: trainy-0.1.1.tar.gz
- Upload date:
- Size: 7.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80cd91ec0b40c9b8a0c7ab323b454c6665099127ecee8f6c58ac0487b5b96206 |
|
MD5 | 374544adf1901f0b99bc93adb1293917 |
|
BLAKE2b-256 | a174f7f3160df39f4ad014942d83ad782b9e23c5888ea33dc4979467640cb3fb |
File details
Details for the file trainy-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: trainy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 6.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cff9f4a609f512c02e55f09ed468d602d996dd41451a5b0b6cb6d8c27b3c333e |
|
MD5 | ba84a403ef3d184127cfa1dc756fa828 |
|
BLAKE2b-256 | 182f9ffa2a9399bc211cbdf796ad27ef1101c226e78519f3392384b5e6bb4823 |