Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

🐙 OctoRun

Distributed Parallel Execution Made Simple

A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring

PyPI version Python CUDA License Build Status


📋 Overview

OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.

✨ Key Features

  • 🔍 Automatic GPU Detection: Automatically detects and utilizes available GPUs
  • 🧩 Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
  • 🔄 Failure Recovery: Automatic retry mechanism for failed chunks
  • 📊 Comprehensive Logging: Detailed logging for monitoring and debugging
  • ⚙️ Flexible Configuration: JSON-based configuration with CLI overrides
  • 🎯 Kwargs Support: Pass custom arguments to your scripts via config or CLI
  • 💾 Memory Monitoring: Monitor GPU memory usage and thresholds
  • 🔒 Lock Management: Prevent duplicate processing of chunks

🚀 Installation

You can install OctoRun using pip or uv.

Via pip

pip install octorun

Via uv

# Install globally
uv tool install octorun

# Install in your project
uv add octorun

Optional extras

  • Benchmark tooling: pip install "octorun[benchmark]" (installs PyTorch with CUDA support)

⚡ Quick Start

  1. Create Configuration:

    octorun save_config --script ./your_script.py
    
  2. Run Your Script:

    octorun run
    
  3. Monitor GPUs:

    octorun list_gpus -d
    

🎮 Commands

run (r)

Run your script with the specified configuration.

octorun run --config config.json [--kwargs '{"key": "value"}']

save_config (s)

Generate a default configuration file.

octorun save_config --script ./your_script.py

list_gpus (l)

List available GPUs and their current usage.

octorun list_gpus [--detailed]

The detailed flag provides a more comprehensive view of GPU stats, including memory usage, temperature, and running processes.

benchmark (b)

Run a benchmark to determine the optimal number of parallel processes for your GPUs.

octorun benchmark

This command runs a series of tests to help you configure the gpus parameter in your config.json for the best performance. Requires the optional benchmark extra (pip install "octorun[benchmark]") so PyTorch is available.

⚙️ Configuration

OctoRun uses a config.json file for configuration. You can generate a default one with octorun save_config.

Option Description Default
script_path Path to your Python script -
gpus "auto" or list of GPU IDs "auto"
total_chunks Number of chunks to divide work into 128
log_dir Directory for log files "./logs"
chunk_lock_dir Directory for chunk lock files "./logs/locks"
monitor_interval Monitoring interval in seconds 60
restart_failed Whether to restart failed processes false
max_retries Maximum retries for failed chunks 3
memory_threshold Memory threshold percentage 90
kwargs Custom arguments to pass to your script {}

🎯 Using Kwargs

You can pass custom arguments to your script via the kwargs object in your config.json or directly through the CLI.

CLI kwargs will override config file kwargs.

octorun run --kwargs '{"batch_size": 128, "learning_rate": 0.005}'

🔧 Script Implementation

Your script must accept the following arguments:

  • --gpu_id: GPU device ID (int)
  • --chunk_id: Current chunk number (int)
  • --total_chunks: Total number of chunks (int)

Here is an example of how to structure your script:

import argparse
import torch

def main():
    parser = argparse.ArgumentParser()
    
    # Required OctoRun arguments
    parser.add_argument('--gpu_id', type=int, required=True)
    parser.add_argument('--chunk_id', type=int, required=True)
    parser.add_argument('--total_chunks', type=int, required=True)
    
    # Your custom arguments
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--model_type', type=str, default='default')
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--output_dir', type=str, default='./output')
    
    args = parser.parse_args()
    
    # Set the GPU device
    if torch.cuda.is_available():
        torch.cuda.set_device(args.gpu_id)
        print(f"Using GPU {args.gpu_id}")
    
    print(f"Processing chunk {args.chunk_id}/{args.total_chunks}")
    
    # Your logic here

if __name__ == "__main__":
    main()

🤝 Contributing

Contributions are welcome! Please fork the repository, create a feature branch, and submit a pull request.

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-0.3.0.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-0.3.0-py3-none-any.whl (20.2 kB view details)

Uploaded Python 3

File details

Details for the file octorun-0.3.0.tar.gz.

File metadata

  • Download URL: octorun-0.3.0.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for octorun-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c7c8298cb508ee2f81c06ca1e624277923ce24feb359cd49e8bd44146d29914f
MD5 fce30debb1f872929d3eaefe03949143
BLAKE2b-256 beb9a5b20b0432b38032087c6352e1449ca6b5a67ac25ba80a7f5408db5577d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.3.0.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: octorun-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 20.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for octorun-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f2a4f03bb50a6744369d03ae0c5394125697367ae00e0e24deec8ced236f08a9
MD5 618b3737ca3cb094ea968df6ad1a9cc7
BLAKE2b-256 7635198b46f8f5c0bcf0b95fafabdc125ddb5ec0ca5c363c8193d035461e59e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.3.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page