Skip to main content

Run distributed training in TractoAI

Project description

img.png

🚜 Tractorun

tractorun is a powerful tool for distributed ML operations on the Tracto.ai platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code.

Besides machine learning, tractorun can also run arbitrary gang operations on the Tracto.ai

Core features

  • Simple distributed training setup on JAX and PyTorch with minimal code changes
  • Convenient ways to run and configure: CLI, YAML config, and Python SDK
  • Integration with the Tracto.ai platform

Getting started

To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at tracto.ai.

Install tractorun into your python3 environment:

pip install --upgrade tractorun

Configure the client to work with your cluster:

mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
  "url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF

Please put your actual Tracto.ai cluster address to $YT_PROXY and your token to $YT_TOKEN.

How to try

Run an example script:

tractorun \
    --yt-path "//tmp/$USER/tractorun_getting_started" \
    --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
    --bind-local-lib ./tractorun \
    --docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-01-15-20-21-21 \
    python3 /lightning_mnist_ddp_script.py

How to run

CLI

tractorun --help

or with yaml config

tractorun --run-config-path config.yaml

You can find a relevant examples:

Python SDK

SDK is convenient to use from Jupyter notebooks for development purposes.

You can find a relevant example in the repository.

WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.

  • This requirement is met in Jupyter Notebook on the Tracto.ai platform.
  • For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in tractorun

How to adapt code for tractorun

CLI

  1. Wrap all training/inference code to a function.
  2. Initiate environment and Toolbox by from tractorun.run.prepare_and_get_toolbox

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli

SDK

  1. Wrap all training/inference code to a function with a toolbox: tractorun.toolbox.Toolbox parameter.
  2. Run this function by tractorun.run.run.

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk

Features

Toolbox

tractorun.toolbox.Toolbox provides extra integrations with the Tracto.ai platform:

  • Preconfigured client by toolbox.yt_client
  • Basic checkpoints by toolbox.checkpoint_manager
  • Control over the operation description in the UI by toolbox.description_manager
  • Access to coordination information by toolbox.coordinator

Backends

Backends configure tractorun to work with a specific ML framework.

Tractorun supports multiple backends:

Options and settings

Options reference page provides an overview of all available options for tractorun, explaining their purpose and usage. Options can be defined by:

  • CLI parameters
  • yaml config
  • python options

Development

Install local environment

  1. Install pyenv
  2. Create and activate new env pyenv virtualenv 3.10 tractorun && pyenv activate tractorun
  3. Install all dependencies: pip install ."[all]

Build new image for tests

./run_build.sh generic
./run_build.sh tractorch_tests
./run_build.sh tractorax_tests
./run_build.sh tensorproxy_tests

and update images in ./run_tests and tests/utils.py

Build and push a new image for examples

./run_build.sh examples_runtime --push

and update the image in ./examples/run_example

Update current image tag for tests and examples

./run_update_tag.sh new_tag

Run tests

To run tests on local YT run pytest

./run_tests.sh all . -s

To run tests on remote cluster

./run_tests.sh general . -s
./run_tests.sh tensorproxy . -s

It is possible to provide extra pytest options

./run_tests.sh generic test_sidecars.py
./run_tests.sh generic test_sidecars.py::test_sidecar_run

Build and upload

  1. Run Create release
  2. Run Build and upload to external pypi. Specify the latest tag from the list to upload the latest version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tractorun-0.54.0.tar.gz (67.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tractorun-0.54.0-py3-none-any.whl (109.9 kB view details)

Uploaded Python 3

File details

Details for the file tractorun-0.54.0.tar.gz.

File metadata

  • Download URL: tractorun-0.54.0.tar.gz
  • Upload date:
  • Size: 67.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.54.0.tar.gz
Algorithm Hash digest
SHA256 1f3d04cb2e59c3a8af0f0de702c85b63480a351b954c544e0be5289d49beb187
MD5 8db5e8e10fc9ba592271195dfe13c324
BLAKE2b-256 37212cf3925a651b1c9df7b9bce999ae9af067f25adf8e66bb0759a842953872

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.54.0.tar.gz:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tractorun-0.54.0-py3-none-any.whl.

File metadata

  • Download URL: tractorun-0.54.0-py3-none-any.whl
  • Upload date:
  • Size: 109.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.54.0-py3-none-any.whl
Algorithm Hash digest
SHA256 46dcff81cb4e4e7d5af04832556d812c834d044fb81cedca727c83491fd2fd41
MD5 b343e0a7447ead63d7366c426856b7e4
BLAKE2b-256 66e7f15c52bd643c6566da4cb467c58b5ae07e7a69677d549f3533250a4df35f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.54.0-py3-none-any.whl:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page