Skip to main content

A tool for submitting and managing distributed PyTorch jobs

Project description

Torch Submit

Introduction

Torch Submit is a lightweight, easy-to-use tool for running distributed PyTorch jobs across multiple machines. It's designed for researchers and developers who:

  • Have access to a bunch of machines with IP addresses
  • Want to run distributed PyTorch jobs without the hassle
  • Don't have the time, energy, or patience to set up complex cluster management systems like SLURM or Kubernetes

Under the hood, Torch Submit uses Fabric to copy your working directory to the remote addresses and TorchRun to execute the command.

It's encouraged to read torch_submit/executor.py to understand how jobs are created and scheduled.

Features

  • Simple cluster configuration: Just add your machines' IP addresses
  • Easy job submission: Run your PyTorch jobs with a single command
  • Job management: Submit, stop, restart, and monitor your jobs
  • Log tailing: Easily view the logs of your running jobs
  • Optuna Integration for parallel hyperparameter optimization

Installation

pip install torch-submit

or from source:

pip install -e . --prefix ~/.local

Quick Start

  1. Set up a cluster:

    torch-submit cluster create
    

    Follow the interactive prompts to add your machines.

  2. Submit a job:

    torch-submit job submit --cluster my_cluster -- <entrypoint>
    # for example:
    # torch-submit job submit --cluster my_cluster -- python train.py
    # torch-submit job submit --cluster my_cluster -- python -m main.train
    
  3. List running jobs:

    torch-submit job list
    
  4. Tail logs:

    torch-submit logs tail <job_id>
    
  5. Stop a job:

    torch-submit job stop <job_id>
    
  6. Restart a stopped job:

    torch-submit job restart <job_id>
    

Usage

Cluster Management

  • Create a cluster: torch-submit cluster create
  • List clusters: torch-submit cluster list
  • Remove a cluster: torch-submit cluster remove <cluster_name>

Job Management

  • Submit a job: torch-submit job submit --cluster my_cluster -- <entrypoint>
  • List jobs: torch-submit job list
  • Stop a job: torch-submit job stop <job_id>
  • Restart a job: torch-submit job restart <job_id>

Log Management

  • Tail logs: torch-submit job logs <job_id>

Optuna

The Optuna exectuor requires setting a database connection. This can be done via torch-submit db create. This will create a new database within the specified connection called torch_submit. This database should be accessible to all machines in a cluster. Study name and storage info will be accessible to to the job via "OPTUNA_STUDY_NAME" and "OPTUNA_STORAGE" environment variables.

Configuration

Torch Submit stores cluster configurations in ~/.cache/torch-submit/config.yaml. You can manually edit this file if needed, but it's recommended to use the CLI commands for cluster management.

Requirements

  • Python 3.7+
  • PyTorch (for your actual jobs)
  • SSH access to all machines in your cluster

Contributing

We welcome contributions! Please see our Contributing Guide for more details.

License

Torch Submit is released under the MIT License. See the LICENSE file for more details.

Support

If you encounter any issues or have questions, please file an issue on our GitHub Issues page.

Happy distributed training!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torch_submit-0.1.24.tar.gz (34.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

torch_submit-0.1.24-py3-none-any.whl (25.8 kB view details)

Uploaded Python 3

File details

Details for the file torch_submit-0.1.24.tar.gz.

File metadata

  • Download URL: torch_submit-0.1.24.tar.gz
  • Upload date:
  • Size: 34.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for torch_submit-0.1.24.tar.gz
Algorithm Hash digest
SHA256 6a35831e7caffcd0587cf7177c37d17ec50c2b2ff22f34813331fd857b95b7d9
MD5 ae07b4a5e1d57a875e06988fe7f5c085
BLAKE2b-256 d2e221499736fa2f58244d2b9c4f4b6003b1e381f96f7971230faf30bee7ee51

See more details on using hashes here.

File details

Details for the file torch_submit-0.1.24-py3-none-any.whl.

File metadata

  • Download URL: torch_submit-0.1.24-py3-none-any.whl
  • Upload date:
  • Size: 25.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for torch_submit-0.1.24-py3-none-any.whl
Algorithm Hash digest
SHA256 b070a8dd122584b3c417e14eca088d088a52f5e4805a3b07bc26d267d6f58bf4
MD5 eb0a78b50c797bb64a42ab8ffe97e86f
BLAKE2b-256 5b1737160fca3289ba24b09c8d090a661008a99445c88e6a8e18b42ee96c93a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page