Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

🐙 OctoRun

Distributed Parallel Execution Made Simple

A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring

PyPI version Python CUDA License Build Status


📋 Overview

OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.

✨ Key Features

  • 🔍 Automatic GPU Detection: Automatically detects and utilizes available GPUs
  • 🧩 Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
  • 🔄 Failure Recovery: Automatic retry mechanism for failed chunks
  • 📊 Comprehensive Logging: Detailed logging for monitoring and debugging
  • ⚙️ Flexible Configuration: JSON-based configuration with CLI overrides
  • 🎯 Kwargs Support: Pass custom arguments to your scripts via config or CLI
  • 💾 Memory Monitoring: Monitor GPU memory usage and thresholds
  • 🔒 Lock Management: Prevent duplicate processing of chunks

🚀 Installation

You can install OctoRun using pip or uv.

Via pip

pip install octorun

Via uv

# Install globally
uv tool install octorun

# Install in your project
uv add octorun

⚡ Quick Start

  1. Create Configuration:

    octorun save_config --script ./your_script.py
    
  2. Run Your Script:

    octorun run
    
  3. Monitor GPUs:

    octorun list_gpus -d
    

🎮 Commands

run (r)

Run your script with the specified configuration.

octorun run --config config.json [--kwargs '{"key": "value"}']

save_config (s)

Generate a default configuration file.

octorun save_config --script ./your_script.py

list_gpus (l)

List available GPUs and their current usage.

octorun list_gpus [--detailed]

The detailed flag provides a more comprehensive view of GPU stats, including memory usage, temperature, and running processes.

benchmark (b)

Run a benchmark to determine the optimal number of parallel processes for your GPUs.

octorun benchmark

This command runs a series of tests to help you configure the gpus parameter in your config.json for the best performance.

⚙️ Configuration

OctoRun uses a config.json file for configuration. You can generate a default one with octorun save_config.

Option Description Default
script_path Path to your Python script -
gpus "auto" or list of GPU IDs "auto"
total_chunks Number of chunks to divide work into 128
log_dir Directory for log files "./logs"
chunk_lock_dir Directory for chunk lock files "./logs/locks"
monitor_interval Monitoring interval in seconds 60
restart_failed Whether to restart failed processes false
max_retries Maximum retries for failed chunks 3
memory_threshold Memory threshold percentage 90
kwargs Custom arguments to pass to your script {}

🎯 Using Kwargs

You can pass custom arguments to your script via the kwargs object in your config.json or directly through the CLI.

CLI kwargs will override config file kwargs.

octorun run --kwargs '{"batch_size": 128, "learning_rate": 0.005}'

🔧 Script Implementation

Your script must accept the following arguments:

  • --gpu_id: GPU device ID (int)
  • --chunk_id: Current chunk number (int)
  • --total_chunks: Total number of chunks (int)

Here is an example of how to structure your script:

import argparse
import torch

def main():
    parser = argparse.ArgumentParser()
    
    # Required OctoRun arguments
    parser.add_argument('--gpu_id', type=int, required=True)
    parser.add_argument('--chunk_id', type=int, required=True)
    parser.add_argument('--total_chunks', type=int, required=True)
    
    # Your custom arguments
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--model_type', type=str, default='default')
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--output_dir', type=str, default='./output')
    
    args = parser.parse_args()
    
    # Set the GPU device
    if torch.cuda.is_available():
        torch.cuda.set_device(args.gpu_id)
        print(f"Using GPU {args.gpu_id}")
    
    print(f"Processing chunk {args.chunk_id}/{args.total_chunks}")
    
    # Your logic here

if __name__ == "__main__":
    main()

🤝 Contributing

Contributions are welcome! Please fork the repository, create a feature branch, and submit a pull request.

📄 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-0.2.0.tar.gz (44.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-0.2.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file octorun-0.2.0.tar.gz.

File metadata

  • Download URL: octorun-0.2.0.tar.gz
  • Upload date:
  • Size: 44.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for octorun-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d87687ffe4f2b67263b6423f99fb0498c841dd2cc0eaeb53fe4144181f6c0c23
MD5 64eb3ff149d60a2cdfb61590011905b0
BLAKE2b-256 5618a21c9ac52d839174ec85fca8128a62647509f3788cc6ab3666a8aa33d695

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.2.0.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: octorun-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for octorun-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe2732a916151563f6ade1ff7617d56d8416db074c78d7e99907a661e7748e97
MD5 3dd0e592060d4b32c4e071f61504b937
BLAKE2b-256 6e60382553a21660bdeef0960881cb578e8e3f04b43d4b7206f16f48581384b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.2.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page