Run distributed training in TractoAI
Project description
🚜 Tractorun
tractorun is a powerful tool for distributed ML operations on the Tracto.ai platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code.
Besides machine learning, tractorun can also run arbitrary gang operations on the Tracto.ai
Core features
- Simple distributed training setup on JAX and PyTorch with minimal code changes
- Convenient ways to run and configure: CLI, YAML config, and Python SDK
- Integration with the Tracto.ai platform
Getting started
To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at tracto.ai.
Install tractorun into your python3 environment:
pip install --upgrade tractorun
Configure the client to work with your cluster:
mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
"url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF
Please put your actual Tracto.ai cluster address to $YT_PROXY and your token to $YT_TOKEN.
How to try
Run an example script:
tractorun \
--yt-path "//tmp/$USER/tractorun_getting_started" \
--bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
--bind-local-lib ./tractorun \
--docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-01-15-20-21-21 \
python3 /lightning_mnist_ddp_script.py
How to run
CLI
tractorun --help
or with yaml config
tractorun --run-config-path config.yaml
You can find a relevant examples:
Python SDK
SDK is convenient to use from Jupyter notebooks for development purposes.
You can find a relevant example in the repository.
WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.
- This requirement is met in Jupyter Notebook on the Tracto.ai platform.
- For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in
tractorun
How to adapt code for tractorun
CLI
- Wrap all training/inference code to a function.
- Initiate environment and Toolbox by
from tractorun.run.prepare_and_get_toolbox
An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli
SDK
- Wrap all training/inference code to a function with a
toolbox: tractorun.toolbox.Toolboxparameter. - Run this function by
tractorun.run.run.
An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk
Features
Toolbox
tractorun.toolbox.Toolbox provides extra integrations with the Tracto.ai platform:
- Preconfigured client by
toolbox.yt_client - Basic checkpoints by
toolbox.checkpoint_manager - Control over the operation description in the UI by
toolbox.description_manager - Access to coordination information by
toolbox.coordinator
Backends
Backends configure tractorun to work with a specific ML framework.
Tractorun supports multiple backends:
- Tractorch for PyTorch
- Tractorax for Jax
- Generic
- non-specialized backend, can be used as a basis for other backends
Options and settings
Options reference page provides an overview of all available options for tractorun, explaining their purpose and usage. Options can be defined by:
- CLI parameters
- yaml config
- python options
Development
Install local environment
- Install pyenv
- Create and activate new env
pyenv virtualenv 3.10 tractorun && pyenv activate tractorun - Install all dependencies:
pip install ."[all]
Build new image for tests
./run_build.sh generic
./run_build.sh tractorch_tests
./run_build.sh tractorax_tests
./run_build.sh tensorproxy_tests
and update images in ./run_tests and tests/utils.py
Build and push a new image for examples
./run_build.sh examples_runtime --push
and update the image in ./examples/run_example
Update current image tag for tests and examples
./run_update_tag.sh new_tag
Run tests
To run tests on local YT run pytest
./run_tests.sh all . -s
To run tests on remote cluster
./run_tests.sh general . -s
./run_tests.sh tensorproxy . -s
It is possible to provide extra pytest options
./run_tests.sh generic test_sidecars.py
./run_tests.sh generic test_sidecars.py::test_sidecar_run
Build and upload
- Run Create release
- Run Build and upload to external pypi. Specify the latest tag from the list to upload the latest version.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tractorun-0.55.0.tar.gz.
File metadata
- Download URL: tractorun-0.55.0.tar.gz
- Upload date:
- Size: 67.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46bd8f1a97ae743b2843f93604edd3c5e38e3df55cf7c32e974020100a2f8e51
|
|
| MD5 |
5442a5048246bfd2e2b1d7ec33018d7c
|
|
| BLAKE2b-256 |
9ed8e727750f03836202fe386d5d26a55dc992a0b475d37606abeec9fda13991
|
Provenance
The following attestation bundles were made for tractorun-0.55.0.tar.gz:
Publisher:
pypi_external.yaml on tractoai/tractorun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tractorun-0.55.0.tar.gz -
Subject digest:
46bd8f1a97ae743b2843f93604edd3c5e38e3df55cf7c32e974020100a2f8e51 - Sigstore transparency entry: 162969868
- Sigstore integration time:
-
Permalink:
tractoai/tractorun@7c22681e28ef39d3a5de529efb8ea2b8d37d3235 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tractoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_external.yaml@7c22681e28ef39d3a5de529efb8ea2b8d37d3235 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file tractorun-0.55.0-py3-none-any.whl.
File metadata
- Download URL: tractorun-0.55.0-py3-none-any.whl
- Upload date:
- Size: 110.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
829924f47e0d29ece805b9445f7f2a8e8a9f2a1e26d45745e7bca533567b4cd0
|
|
| MD5 |
0edb0e3f0fb915d37a9f14d622c69c22
|
|
| BLAKE2b-256 |
ed12d775f35df6ae861a3a3dd059a8bb15c51fb2a7bbd0a98127e8ba35484c9e
|
Provenance
The following attestation bundles were made for tractorun-0.55.0-py3-none-any.whl:
Publisher:
pypi_external.yaml on tractoai/tractorun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tractorun-0.55.0-py3-none-any.whl -
Subject digest:
829924f47e0d29ece805b9445f7f2a8e8a9f2a1e26d45745e7bca533567b4cd0 - Sigstore transparency entry: 162969874
- Sigstore integration time:
-
Permalink:
tractoai/tractorun@7c22681e28ef39d3a5de529efb8ea2b8d37d3235 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tractoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_external.yaml@7c22681e28ef39d3a5de529efb8ea2b8d37d3235 -
Trigger Event:
workflow_dispatch
-
Statement type: