Skip to main content

Run distributed training in TractoAI

Project description

img.png

🚜 Tractorun

tractorun is a powerful tool for distributed ML operations on the Tracto.ai platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code.

Besides machine learning, tractorun can also run arbitrary gang operations on the Tracto.ai

Core features

  • Simple distributed training setup on JAX and PyTorch with minimal code changes
  • Convenient ways to run and configure: CLI, YAML config, and Python SDK
  • Integration with the Tracto.ai platform

Getting started

To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at tracto.ai.

Install tractorun into your python3 environment:

pip install --upgrade tractorun

Configure the client to work with your cluster:

mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
  "url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF

Please put your actual Tracto.ai cluster address to $YT_PROXY and your token to $YT_TOKEN.

How to try

Run an example script:

tractorun \
    --yt-path "//tmp/$USER/tractorun_getting_started" \
    --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
    --bind-local-lib ./tractorun \
    --docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-01-15-20-21-21 \
    python3 /lightning_mnist_ddp_script.py

How to run

CLI

tractorun --help

or with yaml config

tractorun --run-config-path config.yaml

You can find a relevant examples:

Python SDK

SDK is convenient to use from Jupyter notebooks for development purposes.

You can find a relevant example in the repository.

WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.

  • This requirement is met in Jupyter Notebook on the Tracto.ai platform.
  • For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in tractorun

How to adapt code for tractorun

CLI

  1. Wrap all training/inference code to a function.
  2. Initiate environment and Toolbox by from tractorun.run.prepare_and_get_toolbox

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli

SDK

  1. Wrap all training/inference code to a function with a toolbox: tractorun.toolbox.Toolbox parameter.
  2. Run this function by tractorun.run.run.

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk

Features

Toolbox

tractorun.toolbox.Toolbox provides extra integrations with the Tracto.ai platform:

  • Preconfigured client by toolbox.yt_client
  • Basic checkpoints by toolbox.checkpoint_manager
  • Control over the operation description in the UI by toolbox.description_manager
  • Access to coordination information by toolbox.coordinator

Backends

Backends configure tractorun to work with a specific ML framework.

Tractorun supports multiple backends:

Options and settings

Options reference page provides an overview of all available options for tractorun, explaining their purpose and usage. Options can be defined by:

  • CLI parameters
  • yaml config
  • python options

Development

Install local environment

  1. Install pyenv
  2. Create and activate new env pyenv virtualenv 3.10 tractorun && pyenv activate tractorun
  3. Install all dependencies: pip install ."[all]

Build new image for tests

./run_build.sh generic
./run_build.sh tractorch_tests
./run_build.sh tractorax_tests
./run_build.sh tensorproxy_tests

and update images in ./run_tests and tests/utils.py

Build and push a new image for examples

./run_build.sh examples_runtime --push

and update the image in ./examples/run_example

Update current image tag for tests and examples

./run_update_tag.sh new_tag

Run tests

To run tests on local YT run pytest

./run_tests.sh all . -s

To run tests on remote cluster

./run_tests.sh general . -s
./run_tests.sh tensorproxy . -s

It is possible to provide extra pytest options

./run_tests.sh generic test_sidecars.py
./run_tests.sh generic test_sidecars.py::test_sidecar_run

Build and upload

  1. Run Create release
  2. Run Build and upload to external pypi. Specify the latest tag from the list to upload the latest version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tractorun-0.53.0.tar.gz (66.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tractorun-0.53.0-py3-none-any.whl (109.1 kB view details)

Uploaded Python 3

File details

Details for the file tractorun-0.53.0.tar.gz.

File metadata

  • Download URL: tractorun-0.53.0.tar.gz
  • Upload date:
  • Size: 66.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.53.0.tar.gz
Algorithm Hash digest
SHA256 a037064b49534d5b212077aef5f52489bebde21dec0a433ee77ba5dfa50cabfd
MD5 09cc49de2851f427c413c415c1662952
BLAKE2b-256 629be78732cb2137d8c07b2056702d0672feab032e14542692a8bf9089dc97ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.53.0.tar.gz:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tractorun-0.53.0-py3-none-any.whl.

File metadata

  • Download URL: tractorun-0.53.0-py3-none-any.whl
  • Upload date:
  • Size: 109.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.53.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c88c99453ecb478ae05f77561783254c9ee06b273b22d985a5bf9150775c934f
MD5 0be522aa16b30540575932e39d2f2289
BLAKE2b-256 0298e036b1f131bc72548a20782857a938d72a22086db19cccd5d02f9a44e5b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.53.0-py3-none-any.whl:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page