Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

๐Ÿ™ OctoRun

Distributed Parallel Execution Made Simple

A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring

Version Python CUDA License Build Status


๐Ÿ“‹ Overview

OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.

โœจ Key Features

  • ๐Ÿ” Automatic GPU Detection: Automatically detects and utilizes available GPUs
  • ๐Ÿงฉ Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
  • ๐Ÿ”„ Failure Recovery: Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Comprehensive Logging: Detailed logging for monitoring and debugging
  • โš™๏ธ Flexible Configuration: JSON-based configuration with CLI overrides
  • ๐ŸŽฏ Kwargs Support: Pass custom arguments to your scripts via config or CLI
  • ๐Ÿ’พ Memory Monitoring: Monitor GPU memory usage and thresholds
  • ๐Ÿ”’ Lock Management: Prevent duplicate processing of chunks

๐Ÿš€ Installation

Quick Run via uv (Without Installation)

uvx octorun [run, save_config, list_gpus]

Via uv (Installation, Globally)

uv tool install octorun

Via uv (Install in Your Own Project)

uv add octorun

Via pip

pip install octorun

โšก Quick Start

1๏ธโƒฃ Create Configuration

octorun save_config --script ./your_script.py

or

octorun s --script ./your_script.py

2๏ธโƒฃ Run Your Script

octorun run [--config config.json]

or

octorun r

3๏ธโƒฃ Monitor GPU Usage

octorun list_gpus [--detailed]

or

octorun l -d

4๏ธโƒฃ View Logs

tail -f logs/session_*.log

and

tail -f logs/chunk_*.log

โš™๏ธ Configuration

๐Ÿ“„ Basic Configuration

The configuration file (config.json) contains the following options:

{
    "script_path": "./your_script.py",
    "gpus": "auto",
    "total_chunks": 128,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": false,
    "max_retries": 3,
    "memory_threshold": 90,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001
    }
}

๐Ÿ”ง Configuration Options

Option Description Default
script_path Path to your Python script -
gpus GPU configuration ("auto" or list of GPU IDs) "auto"
total_chunks Number of chunks to divide work into 128
log_dir Directory for log files "./logs"
chunk_lock_dir Directory for chunk lock files "./logs/locks"
monitor_interval Monitoring interval in seconds 60
restart_failed Whether to restart failed processes false
max_retries Maximum retries for failed chunks 3
memory_threshold Memory threshold percentage 90
kwargs Custom arguments to pass to script {}

๐ŸŽฏ Using Kwargs

OctoRun supports passing additional keyword arguments to your scripts through both the configuration file and command line interface.

๐Ÿ“‹ Configuration File

Add kwargs to your config.json:

{
    "script_path": "./train_model.py",
    "gpus": "auto",
    "total_chunks": 128,
    "kwargs": {
        "batch_size": 64,
        "learning_rate": 0.01,
        "model_type": "transformer",
        "epochs": 10,
        "output_dir": "./results"
    }
}

๐Ÿ–ฅ๏ธ Command Line Interface

Override or add kwargs via command line:

# Override config kwargs
octorun run --config config.json --kwargs '{"batch_size": 128, "learning_rate": 0.005}'

# Add new kwargs
octorun run --config config.json --kwargs '{"model_type": "bert", "max_length": 512}'

๐ŸŽฏ Priority

CLI kwargs > Config file kwargs

CLI kwargs override config file kwargs for the same keys while preserving other config kwargs

๐Ÿ”ง Script Implementation

Your script must accept the required OctoRun arguments plus any custom kwargs:

import argparse

def main():
    parser = argparse.ArgumentParser()
    
    # ๐Ÿ”ง Required OctoRun arguments
    parser.add_argument('--gpu_id', type=int, required=True)
    parser.add_argument('--chunk_id', type=int, required=True)
    parser.add_argument('--total_chunks', type=int, required=True)
    
    # ๐ŸŽฏ Your custom arguments (Optional)
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--model_type', type=str, default='default')
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--output_dir', type=str, default='./output')
    
    args = parser.parse_args()
    
    # ๐ŸŽฎ Device handling - Set the GPU device
    # This is an exmple when using PyTorch
    import torch
    if torch.cuda.is_available():
        torch.cuda.set_device(args.gpu_id)
        print(f"๐ŸŽฎ Using GPU {args.gpu_id}: {torch.cuda.get_device_name(args.gpu_id)}")
    else:
        print("โš ๏ธ  CUDA not available, using CPU")
    
    # โœจ Use the arguments in your script
    print(f"๐Ÿš€ Processing chunk {args.chunk_id}/{args.total_chunks} on GPU {args.gpu_id}")
    print(f"๐ŸŽฏ Training with batch_size={args.batch_size}, lr={args.learning_rate}")
    
    # Your processing logic here
    ...

if __name__ == "__main__":
    main()

๐ŸŽฎ Commands

๐Ÿš€ run (r)

Run your script with the specified configuration:

octorun run --config config.json [--kwargs '{"key": "value"}']

๐Ÿ’พ save_config (s)

Generate a default configuration file:

octorun save_config [--script ./your_script.py]

๐Ÿ” list_gpus (l)

List available GPUs:

octorun list_gpus [--detailed]

๐Ÿ“š Examples

๐Ÿค– Example 1: Machine Learning Training

Click to expand

Config file (ml_config.json):

{
    "script_path": "./train_model.py",
    "total_chunks": 64,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001,
        "model_type": "resnet50",
        "epochs": 100,
        "dataset_path": "/data/imagenet"
    }
}

Command:

octorun run --config ml_config.json --kwargs '{"batch_size": 64, "learning_rate": 0.01}'

๐Ÿ“Š Example 2: Data Processing

Click to expand
octorun run --config config.json --kwargs '{"input_dir": "/data/raw", "output_dir": "/data/processed", "compression": "gzip"}'

๐Ÿ“Š Monitoring and Logging

OctoRun provides comprehensive logging:

Log Type Location Description
๐Ÿ“‹ Session logs logs/session_TIMESTAMP.log Overall session information
๐Ÿงฉ Chunk logs logs/chunk_N.log Individual chunk processing logs
๐Ÿ”’ Lock files logs/locks/ Chunk completion tracking

๐Ÿ“Š Real-time Monitoring

# Monitor session progress
tail -f logs/session_*.log

# Monitor specific chunk
tail -f logs/chunk_42.log

# Monitor GPU usage
watch -n 1 'octorun list_gpus --detailed'

๐Ÿ› ๏ธ Error Handling

  • ๐Ÿ”„ Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Configurable maximum retry attempts
  • ๐Ÿ’พ Memory threshold monitoring
  • ๐Ÿ“ Comprehensive error logging

Robust error handling ensures your jobs complete successfully

๐Ÿ“‹ Requirements

  • ๐Ÿ Python โ‰ฅ 3.10
  • ๐ŸŽฎ NVIDIA GPUs with CUDA support
  • ๐Ÿ”ง nvidia-smi tool available in PATH

๐Ÿค Contributing

We welcome contributions! Here's how to get started:

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch
  3. โœจ Make your changes
  4. ๐Ÿงช Add tests
  5. ๐Ÿ“ค Submit a pull request

Contributors PRs Welcome

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ‘จโ€๐Ÿ’ป Author

Haobo Yuan - haoboyuan@ucmerced.edu

๐Ÿ™ Acknowledgements

The project is highly relied on AI tools for code generation and documentation, enhancing productivity and code quality.


Made with โค๏ธ and ๐Ÿค– AI assistance

Star โญ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-0.1.1.post1.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-0.1.1.post1-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file octorun-0.1.1.post1.tar.gz.

File metadata

  • Download URL: octorun-0.1.1.post1.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.1.post1.tar.gz
Algorithm Hash digest
SHA256 857d1f481fec422d00911f7a4208e0587db31638f232581292cacb17bd9fe057
MD5 4df029b98c48bf62f178ed489e2bc18e
BLAKE2b-256 f3be38193f1cbf85249032cd00e1d345a7eaefa1a8e63a5d5e9f70a0cb912551

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.1.post1.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-0.1.1.post1-py3-none-any.whl.

File metadata

  • Download URL: octorun-0.1.1.post1-py3-none-any.whl
  • Upload date:
  • Size: 14.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 ebc7607c377f54c0549095110cd0db7ee9354a9393ea14b2504ea41994ccfcfa
MD5 c1663144129dcd59aec0ca370bf0abf4
BLAKE2b-256 fdfacec9c544752101c21bdf419c152750338b518c2e596323a1512e1d2ddf2e

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.1.post1-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page