Run distributed training in TractoAI
Project description
🚜 Tractorun
Tractorun is a powerful tool for distributed ML operations on the Tracto.ai platform. It helps manage and run workflows across multiple nodes with minimal changes in the user's code:
- Training and fine-tuning models. Use Tractorun to train models across multiple compute nodes efficiently.
- Offline batch inference. Perform fast and scalable model inference.
- Running arbitrary GPU operations, ideal for any computational tasks that require distributed GPU resources.
How it works
Built on top of Tracto.ai, Tractorun is responsible for coordinating distributed machine learning tasks. It has out-of-the-box integrations with PyTorch and Jax, also it can be easily used for any other training or inference framework.
Key advantages:
- No need to manage your cloud infrastructure, such as configuring Kubernetes cluster, or managing GPU and Infiniband drivers. Tracto.ai solves all these infrastructure problems for you.
- No need to coordinate distributed processes. Tractorun handles it based on the training configuration: the number of nodes and GPUs used.
Key features:
- Simple distributed task setup, just specify the number of nodes and GPUs.
- Convenient ways to run and configure: CLI, YAML config, and Python SDK.
- A range of powerful capabilities, including sidecars for auxiliary tasks and transparent mounting of local files directly into distributed operations.
- Integration with the Tracto.ai platform: use datasets and checkpoints stored in the Tracto.ai storage, build pipelines with Tractorun, MapReduce, Clickhouse, Spark, and more.
Getting started
To use these examples, you'll need a Tracto account. If you don't have one yet, please sign up at tracto.ai.
Install tractorun into your python3 environment:
pip install --upgrade tractorun
Configure the client to work with your cluster:
mkdir ~/.yt
cat <<EOF > ~/.yt
"proxy"={
"url"="$YT_PROXY";
};
"token"="$YT_TOKEN";
EOF
Please put your actual Tracto.ai cluster address to $YT_PROXY and your token to $YT_TOKEN.
How to try
Run an example script:
tractorun \
--yt-path "//tmp/$USER/tractorun_getting_started" \
--bind-local './examples/pytorch/lightning_mnist_ddp_script/lightning_mnist_ddp_script.py:/lightning_mnist_ddp_script.py' \
--bind-local-lib ./tractorun \
--docker-image ghcr.io/tractoai/tractorun-examples-runtime:2025-02-10-16-14-27 \
python3 /lightning_mnist_ddp_script.py
How to run
CLI
tractorun --help
or with yaml config
tractorun --run-config-path config.yaml
You can find a relevant examples:
Python SDK
SDK is convenient to use from Jupyter notebooks for development purposes.
You can find a relevant example in the repository.
WARNING: the local environment should be equal to the remote docker image on the TractoAI platform to use SDK.
- This requirement is met in Jupyter Notebook on the Tracto.ai platform.
- For local use, it is recommended to run the code locally in the same container as specified in the docker_image parameter in
tractorun
How to adapt code for tractorun
CLI
- Wrap all training/inference code to a function.
- Initiate environment and Toolbox by
from tractorun.run.prepare_and_get_toolbox
An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/cli
SDK
- Wrap all training/inference code to a function with a
toolbox: tractorun.toolbox.Toolboxparameter. - Run this function by
tractorun.run.run.
An example of adapting the mnist training from the PyTorch repository: https://github.com/tractoai/tractorun/tree/main/examples/adoptation/mnist_simple/sdk
Features
Toolbox
tractorun.toolbox.Toolbox provides extra integrations with the Tracto.ai platform:
- Preconfigured client by
toolbox.yt_client - Basic checkpoints by
toolbox.checkpoint_manager - Control over the operation description in the UI by
toolbox.description_manager - Access to coordination information by
toolbox.coordinator
Toolbox page provides an overview of all available toolbox components.
Coordination
Tractorun always sets following environment variables in each process:
MASTER_ADDR- the address of the master nodeMASTER_PORT- the port of the master nodeWORLD_SIZE- the total number of processesNODE_RANK- the unique id of the current node (job in terms of Tracto.ai)LOCAL_RANK- the unique id of the current process on the current nodeRANK- the unique id of the current process across all nodes
Backends
Backends configure tractorun to work with a specific ML framework.
Tractorun supports multiple backends:
- Tractorch for PyTorch
- Tractorax for Jax
- Generic
- non-specialized backend, can be used as a basis for other backends
Backend page provides an overview of all available backends.
Options and settings
Options reference page provides an overview of all available options for tractorun, explaining their purpose and usage. Options can be defined by:
- CLI parameters
- yaml config
- python options
More information
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tractorun-0.58.0.tar.gz.
File metadata
- Download URL: tractorun-0.58.0.tar.gz
- Upload date:
- Size: 71.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7d1826285ab8a4b416bd0130ec2fc975e2b46dcbaf647709be62f37515241fb
|
|
| MD5 |
303a251a30eecb321840c8ee822c2c1e
|
|
| BLAKE2b-256 |
d5c29f28679437b15329c1a3f19a2670ae0802c508a6af7452ef49299a02c9ec
|
Provenance
The following attestation bundles were made for tractorun-0.58.0.tar.gz:
Publisher:
pypi_external.yaml on tractoai/tractorun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tractorun-0.58.0.tar.gz -
Subject digest:
f7d1826285ab8a4b416bd0130ec2fc975e2b46dcbaf647709be62f37515241fb - Sigstore transparency entry: 171359640
- Sigstore integration time:
-
Permalink:
tractoai/tractorun@434b108012396311ded30d8c45fc23c2c7348023 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tractoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_external.yaml@434b108012396311ded30d8c45fc23c2c7348023 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file tractorun-0.58.0-py3-none-any.whl.
File metadata
- Download URL: tractorun-0.58.0-py3-none-any.whl
- Upload date:
- Size: 114.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
563a5dc4e5db12aed5d1e980883e90d91827f076c0aa01a4993ac89ed3b2761f
|
|
| MD5 |
b37b2de175272281f05cbe5c751ccee1
|
|
| BLAKE2b-256 |
cdbd55c86e4346e346c6e39d8f4ce7a332beaa3217eb2cf172f104323587a439
|
Provenance
The following attestation bundles were made for tractorun-0.58.0-py3-none-any.whl:
Publisher:
pypi_external.yaml on tractoai/tractorun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tractorun-0.58.0-py3-none-any.whl -
Subject digest:
563a5dc4e5db12aed5d1e980883e90d91827f076c0aa01a4993ac89ed3b2761f - Sigstore transparency entry: 171359642
- Sigstore integration time:
-
Permalink:
tractoai/tractorun@434b108012396311ded30d8c45fc23c2c7348023 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/tractoai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_external.yaml@434b108012396311ded30d8c45fc23c2c7348023 -
Trigger Event:
workflow_dispatch
-
Statement type: