Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

๐Ÿ™ OctoRun

Distributed Parallel Execution Made Simple

A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring

Version Python CUDA License Build Status


๐Ÿ“‹ Overview

OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.

โœจ Key Features

  • ๐Ÿ” Automatic GPU Detection: Automatically detects and utilizes available GPUs
  • ๐Ÿงฉ Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
  • ๐Ÿ”„ Failure Recovery: Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Comprehensive Logging: Detailed logging for monitoring and debugging
  • โš™๏ธ Flexible Configuration: JSON-based configuration with CLI overrides
  • ๐ŸŽฏ Kwargs Support: Pass custom arguments to your scripts via config or CLI
  • ๐Ÿ’พ Memory Monitoring: Monitor GPU memory usage and thresholds
  • ๐Ÿ”’ Lock Management: Prevent duplicate processing of chunks

๐Ÿš€ Installation

Via pip

pip install octorun

From source

git clone https://github.com/HarborYuan/OctoRun.git
cd OctoRun
pip install -e .

โšก Quick Start

1๏ธโƒฃ Create Configuration

octorun save_config --script ./your_script.py

2๏ธโƒฃ Run Your Script

octorun run [--config config.json]

3๏ธโƒฃ Monitor GPU Usage

octorun list_gpus [--detailed]

4๏ธโƒฃ View Logs

tail -f logs/session_*.log

โš™๏ธ Configuration

๐Ÿ“„ Basic Configuration

The configuration file (config.json) contains the following options:

{
    "script_path": "./your_script.py",
    "gpus": "auto",
    "total_chunks": 128,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": false,
    "max_retries": 3,
    "memory_threshold": 90,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001
    }
}

๐Ÿ”ง Configuration Options

Option Description Default
script_path Path to your Python script -
gpus GPU configuration ("auto" or list of GPU IDs) "auto"
total_chunks Number of chunks to divide work into 128
log_dir Directory for log files "./logs"
chunk_lock_dir Directory for chunk lock files "./logs/locks"
monitor_interval Monitoring interval in seconds 60
restart_failed Whether to restart failed processes false
max_retries Maximum retries for failed chunks 3
memory_threshold Memory threshold percentage 90
kwargs Custom arguments to pass to script {}

๐ŸŽฏ Using Kwargs

OctoRun supports passing additional keyword arguments to your scripts through both the configuration file and command line interface.

๐Ÿ“‹ Configuration File

Add kwargs to your config.json:

{
    "script_path": "./train_model.py",
    "gpus": "auto",
    "total_chunks": 128,
    "kwargs": {
        "batch_size": 64,
        "learning_rate": 0.01,
        "model_type": "transformer",
        "epochs": 10,
        "output_dir": "./results"
    }
}

๐Ÿ–ฅ๏ธ Command Line Interface

Override or add kwargs via command line:

# Override config kwargs
octorun run --config config.json --kwargs '{"batch_size": 128, "learning_rate": 0.005}'

# Add new kwargs
octorun run --config config.json --kwargs '{"model_type": "bert", "max_length": 512}'

๐ŸŽฏ Priority

CLI kwargs > Config file kwargs

CLI kwargs override config file kwargs for the same keys while preserving other config kwargs

๐Ÿ”ง Script Implementation

Your script must accept the required OctoRun arguments plus any custom kwargs:

import argparse

def main():
    parser = argparse.ArgumentParser()
    
    # ๐Ÿ”ง Required OctoRun arguments
    parser.add_argument('--gpu_id', type=int, required=True)
    parser.add_argument('--chunk_id', type=int, required=True)
    parser.add_argument('--total_chunks', type=int, required=True)
    
    # ๐ŸŽฏ Your custom arguments
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--model_type', type=str, default='default')
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--output_dir', type=str, default='./output')
    
    args = parser.parse_args()
    
    # โœจ Use the arguments in your script
    print(f"๐Ÿš€ Processing chunk {args.chunk_id}/{args.total_chunks} on GPU {args.gpu_id}")
    print(f"๐ŸŽฏ Training with batch_size={args.batch_size}, lr={args.learning_rate}")
    
    # Your processing logic here
    ...

if __name__ == "__main__":
    main()

๐ŸŽฎ Commands

๐Ÿš€ run

Run your script with the specified configuration:

octorun run --config config.json [--kwargs '{"key": "value"}']

๐Ÿ’พ save_config

Generate a default configuration file:

octorun save_config [--script ./your_script.py]

๐Ÿ” list_gpus

List available GPUs:

octorun list_gpus [--detailed]

๐Ÿ“š Examples

๐Ÿค– Example 1: Machine Learning Training

Click to expand

Config file (ml_config.json):

{
    "script_path": "./train_model.py",
    "total_chunks": 64,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001,
        "model_type": "resnet50",
        "epochs": 100,
        "dataset_path": "/data/imagenet"
    }
}

Command:

octorun run --config ml_config.json --kwargs '{"batch_size": 64, "learning_rate": 0.01}'

๐Ÿ“Š Example 2: Data Processing

Click to expand
octorun run --config config.json --kwargs '{"input_dir": "/data/raw", "output_dir": "/data/processed", "compression": "gzip"}'

๐Ÿ“Š Monitoring and Logging

OctoRun provides comprehensive logging:

Log Type Location Description
๐Ÿ“‹ Session logs logs/session_TIMESTAMP.log Overall session information
๐Ÿงฉ Chunk logs logs/chunk_N.log Individual chunk processing logs
๐Ÿ”’ Lock files logs/locks/ Chunk completion tracking

๐Ÿ“Š Real-time Monitoring

# Monitor session progress
tail -f logs/session_*.log

# Monitor specific chunk
tail -f logs/chunk_42.log

# Monitor GPU usage
watch -n 1 'octorun list_gpus --detailed'

๐Ÿ› ๏ธ Error Handling

  • ๐Ÿ”„ Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Configurable maximum retry attempts
  • ๐Ÿ’พ Memory threshold monitoring
  • ๐Ÿ“ Comprehensive error logging

Robust error handling ensures your jobs complete successfully

๐Ÿ“‹ Requirements

  • ๐Ÿ Python โ‰ฅ 3.10
  • ๐ŸŽฎ NVIDIA GPUs with CUDA support
  • ๐Ÿ”ง nvidia-smi tool available in PATH

๐Ÿค Contributing

We welcome contributions! Here's how to get started:

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch
  3. โœจ Make your changes
  4. ๐Ÿงช Add tests
  5. ๐Ÿ“ค Submit a pull request

Contributors PRs Welcome

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ‘จโ€๐Ÿ’ป Author

Haobo Yuan - haoboyuan@ucmerced.edu

๐Ÿ™ Acknowledgements

The project is highly relied on AI tools for code generation and documentation, enhancing productivity and code quality.


Made with โค๏ธ and ๐Ÿค– AI assistance

Star โญ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-0.1.0.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file octorun-0.1.0.tar.gz.

File metadata

  • Download URL: octorun-0.1.0.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.0.tar.gz
Algorithm Hash digest
SHA256 39e2a1f3f59efe4deeaeaf1e2b407284ececc8b22baa86dc37e5be5b7dee5b2a
MD5 a9de1930a7534adda07a598b1df98a42
BLAKE2b-256 98b530430ac820cf05dbd33109b37b2977ed9af8612cd0ef2651d5a4f74a0f8d

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.0.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: octorun-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 53e1cbd1a3e2499654253fdde0ec9e2b259f6c7d65b0bb1125fef88534f2f2f7
MD5 e8c6464fce59a6772ee96a270968f9ac
BLAKE2b-256 c5264d81149d0cecaca839e275bdd1b8a75995e64aa8efee829f095ff47281b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.0-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page