Skip to main content

A command-line tool for distributed parallel execution across multiple GPUs

Project description

๐Ÿ™ OctoRun

Distributed Parallel Execution Made Simple

A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring

PyPI version Python CUDA License Build Status


๐Ÿ“‹ Overview

OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.

โœจ Key Features

  • ๐Ÿ” Automatic GPU Detection: Automatically detects and utilizes available GPUs
  • ๐Ÿงฉ Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
  • ๐Ÿ”„ Failure Recovery: Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Comprehensive Logging: Detailed logging for monitoring and debugging
  • โš™๏ธ Flexible Configuration: JSON-based configuration with CLI overrides
  • ๐ŸŽฏ Kwargs Support: Pass custom arguments to your scripts via config or CLI
  • ๐Ÿ’พ Memory Monitoring: Monitor GPU memory usage and thresholds
  • ๐Ÿ”’ Lock Management: Prevent duplicate processing of chunks

๐Ÿš€ Installation

Quick Run via uv (Without Installation)

uvx octorun [run, save_config, list_gpus]

Via uv (Installation, Globally)

uv tool install octorun

Via uv (Install in Your Own Project)

uv add octorun

Via pip

pip install octorun

โšก Quick Start

1๏ธโƒฃ Create Configuration

octorun save_config --script ./your_script.py

or

octorun s --script ./your_script.py

2๏ธโƒฃ Run Your Script

octorun run [--config config.json]

or

octorun r

3๏ธโƒฃ Monitor GPU Usage

octorun list_gpus [--detailed]

or

octorun l -d

4๏ธโƒฃ View Logs

tail -f logs/session_*.log

and

tail -f logs/chunk_*.log

โš™๏ธ Configuration

๐Ÿ“„ Basic Configuration

The configuration file (config.json) contains the following options:

{
    "script_path": "./your_script.py",
    "gpus": "auto",
    "total_chunks": 128,
    "log_dir": "./logs",
    "chunk_lock_dir": "./logs/locks",
    "monitor_interval": 60,
    "restart_failed": false,
    "max_retries": 3,
    "memory_threshold": 90,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001
    }
}

๐Ÿ”ง Configuration Options

Option Description Default
script_path Path to your Python script -
gpus GPU configuration ("auto" or list of GPU IDs) "auto"
total_chunks Number of chunks to divide work into 128
log_dir Directory for log files "./logs"
chunk_lock_dir Directory for chunk lock files "./logs/locks"
monitor_interval Monitoring interval in seconds 60
restart_failed Whether to restart failed processes false
max_retries Maximum retries for failed chunks 3
memory_threshold Memory threshold percentage 90
kwargs Custom arguments to pass to script {}

๐ŸŽฏ Using Kwargs

OctoRun supports passing additional keyword arguments to your scripts through both the configuration file and command line interface.

๐Ÿ“‹ Configuration File

Add kwargs to your config.json:

{
    "script_path": "./train_model.py",
    "gpus": "auto",
    "total_chunks": 128,
    "kwargs": {
        "batch_size": 64,
        "learning_rate": 0.01,
        "model_type": "transformer",
        "epochs": 10,
        "output_dir": "./results"
    }
}

๐Ÿ–ฅ๏ธ Command Line Interface

Override or add kwargs via command line:

# Override config kwargs
octorun run --config config.json --kwargs '{"batch_size": 128, "learning_rate": 0.005}'

# Add new kwargs
octorun run --config config.json --kwargs '{"model_type": "bert", "max_length": 512}'

๐ŸŽฏ Priority

CLI kwargs > Config file kwargs

CLI kwargs override config file kwargs for the same keys while preserving other config kwargs

๐Ÿ”ง Script Implementation

Your script must accept the required OctoRun arguments plus any custom kwargs:

import argparse

def main():
    parser = argparse.ArgumentParser()
    
    # ๐Ÿ”ง Required OctoRun arguments
    parser.add_argument('--gpu_id', type=int, required=True)
    parser.add_argument('--chunk_id', type=int, required=True)
    parser.add_argument('--total_chunks', type=int, required=True)
    
    # ๐ŸŽฏ Your custom arguments (Optional)
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    parser.add_argument('--model_type', type=str, default='default')
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--output_dir', type=str, default='./output')
    
    args = parser.parse_args()
    
    # ๐ŸŽฎ Device handling - Set the GPU device
    # This is an exmple when using PyTorch
    import torch
    if torch.cuda.is_available():
        torch.cuda.set_device(args.gpu_id)
        print(f"๐ŸŽฎ Using GPU {args.gpu_id}: {torch.cuda.get_device_name(args.gpu_id)}")
    else:
        print("โš ๏ธ  CUDA not available, using CPU")
    
    # โœจ Use the arguments in your script
    print(f"๐Ÿš€ Processing chunk {args.chunk_id}/{args.total_chunks} on GPU {args.gpu_id}")
    print(f"๐ŸŽฏ Training with batch_size={args.batch_size}, lr={args.learning_rate}")
    
    # Your processing logic here
    ...

if __name__ == "__main__":
    main()

๐ŸŽฎ Commands

๐Ÿš€ run (r)

Run your script with the specified configuration:

octorun run --config config.json [--kwargs '{"key": "value"}']

๐Ÿ’พ save_config (s)

Generate a default configuration file:

octorun save_config [--script ./your_script.py]

๐Ÿ” list_gpus (l)

List available GPUs:

octorun list_gpus [--detailed]

๐Ÿ“š Examples

๐Ÿค– Example 1: Machine Learning Training

Click to expand

Config file (ml_config.json):

{
    "script_path": "./train_model.py",
    "total_chunks": 64,
    "kwargs": {
        "batch_size": 32,
        "learning_rate": 0.001,
        "model_type": "resnet50",
        "epochs": 100,
        "dataset_path": "/data/imagenet"
    }
}

Command:

octorun run --config ml_config.json --kwargs '{"batch_size": 64, "learning_rate": 0.01}'

๐Ÿ“Š Example 2: Data Processing

Click to expand
octorun run --config config.json --kwargs '{"input_dir": "/data/raw", "output_dir": "/data/processed", "compression": "gzip"}'

๐Ÿ“Š Monitoring and Logging

OctoRun provides comprehensive logging:

Log Type Location Description
๐Ÿ“‹ Session logs logs/session_TIMESTAMP.log Overall session information
๐Ÿงฉ Chunk logs logs/chunk_N.log Individual chunk processing logs
๐Ÿ”’ Lock files logs/locks/ Chunk completion tracking

๐Ÿ“Š Real-time Monitoring

# Monitor session progress
tail -f logs/session_*.log

# Monitor specific chunk
tail -f logs/chunk_42.log

# Monitor GPU usage
watch -n 1 'octorun list_gpus --detailed'

๐Ÿ› ๏ธ Error Handling

  • ๐Ÿ”„ Automatic retry mechanism for failed chunks
  • ๐Ÿ“Š Configurable maximum retry attempts
  • ๐Ÿ’พ Memory threshold monitoring
  • ๐Ÿ“ Comprehensive error logging

Robust error handling ensures your jobs complete successfully

๐Ÿ“‹ Requirements

  • ๐Ÿ Python โ‰ฅ 3.10
  • ๐ŸŽฎ NVIDIA GPUs with CUDA support
  • ๐Ÿ”ง nvidia-smi tool available in PATH

๐Ÿค Contributing

We welcome contributions! Here's how to get started:

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch
  3. โœจ Make your changes
  4. ๐Ÿงช Add tests
  5. ๐Ÿ“ค Submit a pull request

Contributors PRs Welcome

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ‘จโ€๐Ÿ’ป Author

Haobo Yuan - haoboyuan@ucmerced.edu

๐Ÿ™ Acknowledgements

The project is highly relied on AI tools for code generation and documentation, enhancing productivity and code quality.


Made with โค๏ธ and ๐Ÿค– AI assistance

Star โญ this repo if you find it useful!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octorun-0.1.2.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

octorun-0.1.2-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file octorun-0.1.2.tar.gz.

File metadata

  • Download URL: octorun-0.1.2.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3309e4dd028c3870b3aaa192e4da796d23c2b51d5f34a9ec166510f1630b8d00
MD5 771042399a14faad120acb1fed4e0c8f
BLAKE2b-256 34af3eb1db0c0bdc60493eb4dfc7fac9742428f1df284f0ea5952e891177ff84

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.2.tar.gz:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file octorun-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: octorun-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for octorun-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d5832a62c1f34c36d7c7e93b609157c70f7f3b7a8dfee4fa4c2557bc8e5b6ae7
MD5 b05e16d67e9d03b28a038e2a230a2d30
BLAKE2b-256 99eafb16a1aef8430310b2946fbef779eaf206d8c9386941ccdd098b19527037

See more details on using hashes here.

Provenance

The following attestation bundles were made for octorun-0.1.2-py3-none-any.whl:

Publisher: publish.yml on HarborYuan/OctoRun

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page