Skip to main content

Run distributed training in TractoAI

Project description

img.png

🚜 Tractorun

tractorun is a powerful tool for distributed ML operations on the Tracto.ai platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code.

Besides machine learning, tractorun can also run arbitrary gang operations on the Tracto.ai

Core features

  • Simple distributed training setup on JAX and PyTorch with minimal code changes
  • Convenient ways to run and configure: CLI, YAML config, and Python SDK
  • Integration with the Tracto.ai platform

Getting started

Install tractorun into your python3 environment:

pip install --upgrade tractorun

Configure the client to work with your cluster:

mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
  "url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF

Please put your actual Tracto.ai cluster address to $YT_PROXY and your token to $YT_TOKEN.

How to try

Run an example script:

tractorun \
    --yt-path "//tmp/$USER/tractorun_getting_started" \
    --bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
    --bind-local-lib ./tractorun \
    --docker-image cr.ai.nebius.cloud/crnf2coti090683j5ssi/tractorun/examples_runtime:2024-11-20-20-00-05 \
    python3 /lightning_mnist_ddp_script.py

How to run

CLI

tractorun --help

or with yaml config

tractorun --run-config-path config.yaml

You can find a relevant examples:

Python SDK

SDK is convenient to use from Jupyter notebooks for development purposes.

You can find a relevant example in the repository.

WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.

  • This requirement is met in Jupyter Notebook on the Tracto.ai platform.
  • For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in tractorun

How to adapt code for tractorun

CLI

  1. Wrap all training/inference code to a function.
  2. Initiate environment and Toolbox by from tractorun.run.prepare_and_get_toolbox

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli

SDK

  1. Wrap all training/inference code to a function with a toolbox: tractorun.toolbox.Toolbox parameter.
  2. Run this function by tractorun.run.run.

An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk

Features

Toolbox

tractorun.toolbox.Toolbox provides extra integrations with the Tracto.ai platform:

  • Preconfigured client by toolbox.yt_client
  • Basic checkpoints by toolbox.checkpoint_manager
  • Control over the operation description in the UI by toolbox.description_manager
  • Access to coordination information by toolbox.coordinator

Backends

Backends configure tractorun to work with a specific ML framework.

Tractorun supports multiple backends:

Options and settings

Options reference page provides an overview of all available options for tractorun, explaining their purpose and usage. Options can be defined by:

  • CLI parameters
  • yaml config
  • python options

Development

Install local environment

  1. Install pyenv
  2. Create and activate new env pyenv virtualenv 3.10 tractorun && pyenv activate tractorun
  3. Install all dependencies: pip install ."[all]

Build new image for tests

./run_build generic
./run_build tractorch_tests
./run_build tractorax_tests
./run_build tensorproxy_tests

and update images in ./run_tests and tests/utils.py

Build and push a new image for examples

./run_build examples_runtime --push

and update the image in ./examples/run_example

Run tests

To run tests on local YT run pytest

./run_tests all . -s

To run tests on remote cluster

./run_tests general . -s
./run_tests tensorproxy . -s

It is possible to provide extra pytest options

./run_tests generic test_sidecars.py
./run_tests generic test_sidecars.py::test_sidecar_run

Build and upload

  1. Run Create release
  2. Run Build and upload to external pypi. Specify the latest tag from the list to upload the latest version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tractorun-0.51.0.tar.gz (65.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tractorun-0.51.0-py3-none-any.whl (108.3 kB view details)

Uploaded Python 3

File details

Details for the file tractorun-0.51.0.tar.gz.

File metadata

  • Download URL: tractorun-0.51.0.tar.gz
  • Upload date:
  • Size: 65.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.51.0.tar.gz
Algorithm Hash digest
SHA256 e2b2b6d466481622d5452497de716d30934da106afa028b31d107508f4604b2b
MD5 167adeedeffdd43eb84201433682dd41
BLAKE2b-256 cdb6c3a804be0ee85b8324e35818b645c0402457689a8124451d8779c80a5251

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.51.0.tar.gz:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tractorun-0.51.0-py3-none-any.whl.

File metadata

  • Download URL: tractorun-0.51.0-py3-none-any.whl
  • Upload date:
  • Size: 108.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for tractorun-0.51.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e117d98b4fc28a7e93ca8fad5facd806db9b356d8cce5cccaf5097458fa23cdf
MD5 8ae843e0bff1551736834432f5ec4d43
BLAKE2b-256 369b2bb31d559e53321818f37416360ab10fae9fe34d5a1e8cc5d2be60f6e2ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for tractorun-0.51.0-py3-none-any.whl:

Publisher: pypi_external.yaml on tractoai/tractorun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page