A command-line tool for distributed parallel execution across multiple GPUs
Project description
🐙 OctoRun
Distributed Parallel Execution Made Simple
A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring
📋 Overview
OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.
✨ Key Features
- 🔍 Automatic GPU Detection: Automatically detects and utilizes available GPUs
- 🧩 Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
- 🔄 Failure Recovery: Automatic retry mechanism for failed chunks
- 📊 Comprehensive Logging: Detailed logging for monitoring and debugging
- ⚙️ Flexible Configuration: JSON-based configuration with CLI overrides
- 🎯 Kwargs Support: Pass custom arguments to your scripts via config or CLI
- 💾 Memory Monitoring: Monitor GPU memory usage and thresholds
- 🔒 Lock Management: Prevent duplicate processing of chunks
🚀 Installation
You can install OctoRun using pip or uv.
Via pip
pip install octorun
Via uv
# Install globally
uv tool install octorun
# Install in your project
uv add octorun
Optional extras
- Benchmark tooling:
pip install "octorun[benchmark]"(installs PyTorch with CUDA support)
⚡ Quick Start
-
Create Configuration:
octorun save_config --script ./your_script.py
-
Run Your Script:
octorun run -
Monitor GPUs:
octorun list_gpus -d
🎮 Commands
run (r)
Run your script with the specified configuration.
octorun run --config config.json [--kwargs '{"key": "value"}']
save_config (s)
Generate a default configuration file.
octorun save_config --script ./your_script.py
list_gpus (l)
List available GPUs and their current usage.
octorun list_gpus [--detailed]
The detailed flag provides a more comprehensive view of GPU stats, including memory usage, temperature, and running processes.
benchmark (b)
Run a benchmark to determine the optimal number of parallel processes for your GPUs.
octorun benchmark
This command runs a series of tests to help you configure the gpus parameter in your config.json for the best performance.
Requires the optional benchmark extra (pip install "octorun[benchmark]") so PyTorch is available.
⚙️ Configuration
OctoRun uses a config.json file for configuration. You can generate a default one with octorun save_config.
| Option | Description | Default |
|---|---|---|
script_path |
Path to your Python script | - |
gpus |
"auto" or list of GPU IDs | "auto" |
total_chunks |
Number of chunks to divide work into | 128 |
log_dir |
Directory for log files | "./logs" |
chunk_lock_dir |
Directory for chunk lock files | "./logs/locks" |
monitor_interval |
Monitoring interval in seconds | 60 |
restart_failed |
Whether to restart failed processes | false |
max_retries |
Maximum retries for failed chunks | 3 |
memory_threshold |
Memory threshold percentage | 90 |
kwargs |
Custom arguments to pass to your script | {} |
🎯 Using Kwargs
You can pass custom arguments to your script via the kwargs object in your config.json or directly through the CLI.
CLI kwargs will override config file kwargs.
octorun run --kwargs '{"batch_size": 128, "learning_rate": 0.005}'
🔧 Script Implementation
Your script must accept the following arguments:
--gpu_id: GPU device ID (int)--chunk_id: Current chunk number (int)--total_chunks: Total number of chunks (int)
Here is an example of how to structure your script:
import argparse
import torch
def main():
parser = argparse.ArgumentParser()
# Required OctoRun arguments
parser.add_argument('--gpu_id', type=int, required=True)
parser.add_argument('--chunk_id', type=int, required=True)
parser.add_argument('--total_chunks', type=int, required=True)
# Your custom arguments
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--learning_rate', type=float, default=0.001)
parser.add_argument('--model_type', type=str, default='default')
parser.add_argument('--epochs', type=int, default=1)
parser.add_argument('--output_dir', type=str, default='./output')
args = parser.parse_args()
# Set the GPU device
if torch.cuda.is_available():
torch.cuda.set_device(args.gpu_id)
print(f"Using GPU {args.gpu_id}")
print(f"Processing chunk {args.chunk_id}/{args.total_chunks}")
# Your logic here
if __name__ == "__main__":
main()
🤝 Contributing
Contributions are welcome! Please fork the repository, create a feature branch, and submit a pull request.
📄 License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file octorun-0.3.0.tar.gz.
File metadata
- Download URL: octorun-0.3.0.tar.gz
- Upload date:
- Size: 44.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7c8298cb508ee2f81c06ca1e624277923ce24feb359cd49e8bd44146d29914f
|
|
| MD5 |
fce30debb1f872929d3eaefe03949143
|
|
| BLAKE2b-256 |
beb9a5b20b0432b38032087c6352e1449ca6b5a67ac25ba80a7f5408db5577d8
|
Provenance
The following attestation bundles were made for octorun-0.3.0.tar.gz:
Publisher:
publish.yml on HarborYuan/OctoRun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octorun-0.3.0.tar.gz -
Subject digest:
c7c8298cb508ee2f81c06ca1e624277923ce24feb359cd49e8bd44146d29914f - Sigstore transparency entry: 1201500199
- Sigstore integration time:
-
Permalink:
HarborYuan/OctoRun@48dde6911e92d13d7021e519faba3961baedabbc -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@48dde6911e92d13d7021e519faba3961baedabbc -
Trigger Event:
release
-
Statement type:
File details
Details for the file octorun-0.3.0-py3-none-any.whl.
File metadata
- Download URL: octorun-0.3.0-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f2a4f03bb50a6744369d03ae0c5394125697367ae00e0e24deec8ced236f08a9
|
|
| MD5 |
618b3737ca3cb094ea968df6ad1a9cc7
|
|
| BLAKE2b-256 |
7635198b46f8f5c0bcf0b95fafabdc125ddb5ec0ca5c363c8193d035461e59e7
|
Provenance
The following attestation bundles were made for octorun-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on HarborYuan/OctoRun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octorun-0.3.0-py3-none-any.whl -
Subject digest:
f2a4f03bb50a6744369d03ae0c5394125697367ae00e0e24deec8ced236f08a9 - Sigstore transparency entry: 1201500204
- Sigstore integration time:
-
Permalink:
HarborYuan/OctoRun@48dde6911e92d13d7021e519faba3961baedabbc -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@48dde6911e92d13d7021e519faba3961baedabbc -
Trigger Event:
release
-
Statement type: