A command-line tool for distributed parallel execution across multiple GPUs
Project description
๐ OctoRun
Distributed Parallel Execution Made Simple
A powerful command-line tool for running Python scripts across multiple GPUs with intelligent task management and monitoring
๐ Overview
OctoRun is designed to help you run computationally intensive Python scripts across multiple GPUs efficiently. It automatically manages GPU allocation, chunks your workload, handles failures with retry mechanisms, and provides comprehensive monitoring and logging.
โจ Key Features
- ๐ Automatic GPU Detection: Automatically detects and utilizes available GPUs
- ๐งฉ Intelligent Chunk Management: Divides work into chunks and distributes across GPUs
- ๐ Failure Recovery: Automatic retry mechanism for failed chunks
- ๐ Comprehensive Logging: Detailed logging for monitoring and debugging
- โ๏ธ Flexible Configuration: JSON-based configuration with CLI overrides
- ๐ฏ Kwargs Support: Pass custom arguments to your scripts via config or CLI
- ๐พ Memory Monitoring: Monitor GPU memory usage and thresholds
- ๐ Lock Management: Prevent duplicate processing of chunks
๐ Installation
Via pip
pip install octorun
From source
git clone https://github.com/HarborYuan/OctoRun.git
cd OctoRun
pip install -e .
โก Quick Start
1๏ธโฃ Create Configurationoctorun save_config --script ./your_script.py
|
2๏ธโฃ Run Your Scriptoctorun run [--config config.json]
|
3๏ธโฃ Monitor GPU Usageoctorun list_gpus [--detailed]
|
4๏ธโฃ View Logstail -f logs/session_*.log
|
โ๏ธ Configuration
๐ Basic Configuration
The configuration file (config.json) contains the following options:
{
"script_path": "./your_script.py",
"gpus": "auto",
"total_chunks": 128,
"log_dir": "./logs",
"chunk_lock_dir": "./logs/locks",
"monitor_interval": 60,
"restart_failed": false,
"max_retries": 3,
"memory_threshold": 90,
"kwargs": {
"batch_size": 32,
"learning_rate": 0.001
}
}
๐ง Configuration Options
| Option | Description | Default |
|---|---|---|
script_path |
Path to your Python script | - |
gpus |
GPU configuration ("auto" or list of GPU IDs) | "auto" |
total_chunks |
Number of chunks to divide work into | 128 |
log_dir |
Directory for log files | "./logs" |
chunk_lock_dir |
Directory for chunk lock files | "./logs/locks" |
monitor_interval |
Monitoring interval in seconds | 60 |
restart_failed |
Whether to restart failed processes | false |
max_retries |
Maximum retries for failed chunks | 3 |
memory_threshold |
Memory threshold percentage | 90 |
kwargs |
Custom arguments to pass to script | {} |
๐ฏ Using Kwargs
OctoRun supports passing additional keyword arguments to your scripts through both the configuration file and command line interface.
๐ Configuration File
Add kwargs to your config.json:
{
"script_path": "./train_model.py",
"gpus": "auto",
"total_chunks": 128,
"kwargs": {
"batch_size": 64,
"learning_rate": 0.01,
"model_type": "transformer",
"epochs": 10,
"output_dir": "./results"
}
}
๐ฅ๏ธ Command Line Interface
Override or add kwargs via command line:
# Override config kwargs
octorun run --config config.json --kwargs '{"batch_size": 128, "learning_rate": 0.005}'
# Add new kwargs
octorun run --config config.json --kwargs '{"model_type": "bert", "max_length": 512}'
๐ฏ Priority
CLI kwargs > Config file kwargs
CLI kwargs override config file kwargs for the same keys while preserving other config kwargs
๐ง Script Implementation
Your script must accept the required OctoRun arguments plus any custom kwargs:
import argparse
def main():
parser = argparse.ArgumentParser()
# ๐ง Required OctoRun arguments
parser.add_argument('--gpu_id', type=int, required=True)
parser.add_argument('--chunk_id', type=int, required=True)
parser.add_argument('--total_chunks', type=int, required=True)
# ๐ฏ Your custom arguments
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--learning_rate', type=float, default=0.001)
parser.add_argument('--model_type', type=str, default='default')
parser.add_argument('--epochs', type=int, default=1)
parser.add_argument('--output_dir', type=str, default='./output')
args = parser.parse_args()
# โจ Use the arguments in your script
print(f"๐ Processing chunk {args.chunk_id}/{args.total_chunks} on GPU {args.gpu_id}")
print(f"๐ฏ Training with batch_size={args.batch_size}, lr={args.learning_rate}")
# Your processing logic here
...
if __name__ == "__main__":
main()
๐ฎ Commands
๐ run
Run your script with the specified configuration:
octorun run --config config.json [--kwargs '{"key": "value"}']
๐พ save_config
Generate a default configuration file:
octorun save_config [--script ./your_script.py]
๐ list_gpus
List available GPUs:
octorun list_gpus [--detailed]
๐ Examples
๐ค Example 1: Machine Learning Training
Click to expand
Config file (ml_config.json):
{
"script_path": "./train_model.py",
"total_chunks": 64,
"kwargs": {
"batch_size": 32,
"learning_rate": 0.001,
"model_type": "resnet50",
"epochs": 100,
"dataset_path": "/data/imagenet"
}
}
Command:
octorun run --config ml_config.json --kwargs '{"batch_size": 64, "learning_rate": 0.01}'
๐ Example 2: Data Processing
Click to expand
octorun run --config config.json --kwargs '{"input_dir": "/data/raw", "output_dir": "/data/processed", "compression": "gzip"}'
๐ Monitoring and Logging
OctoRun provides comprehensive logging:
| Log Type | Location | Description |
|---|---|---|
| ๐ Session logs | logs/session_TIMESTAMP.log |
Overall session information |
| ๐งฉ Chunk logs | logs/chunk_N.log |
Individual chunk processing logs |
| ๐ Lock files | logs/locks/ |
Chunk completion tracking |
๐ Real-time Monitoring
# Monitor session progress
tail -f logs/session_*.log
# Monitor specific chunk
tail -f logs/chunk_42.log
# Monitor GPU usage
watch -n 1 'octorun list_gpus --detailed'
๐ ๏ธ Error Handling
- ๐ Automatic retry mechanism for failed chunks
- ๐ Configurable maximum retry attempts
- ๐พ Memory threshold monitoring
- ๐ Comprehensive error logging
Robust error handling ensures your jobs complete successfully
๐ Requirements
- ๐ Python โฅ 3.10
- ๐ฎ NVIDIA GPUs with CUDA support
- ๐ง nvidia-smi tool available in PATH
๐ค Contributing
We welcome contributions! Here's how to get started:
- ๐ด Fork the repository
- ๐ฟ Create a feature branch
- โจ Make your changes
- ๐งช Add tests
- ๐ค Submit a pull request
๐ License
This project is licensed under the MIT License.
๐จโ๐ป Author
Haobo Yuan - haoboyuan@ucmerced.edu
๐ Acknowledgements
The project is highly relied on AI tools for code generation and documentation, enhancing productivity and code quality.
Made with โค๏ธ and ๐ค AI assistance
Star โญ this repo if you find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file octorun-0.1.0.tar.gz.
File metadata
- Download URL: octorun-0.1.0.tar.gz
- Upload date:
- Size: 28.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39e2a1f3f59efe4deeaeaf1e2b407284ececc8b22baa86dc37e5be5b7dee5b2a
|
|
| MD5 |
a9de1930a7534adda07a598b1df98a42
|
|
| BLAKE2b-256 |
98b530430ac820cf05dbd33109b37b2977ed9af8612cd0ef2651d5a4f74a0f8d
|
Provenance
The following attestation bundles were made for octorun-0.1.0.tar.gz:
Publisher:
publish.yml on HarborYuan/OctoRun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octorun-0.1.0.tar.gz -
Subject digest:
39e2a1f3f59efe4deeaeaf1e2b407284ececc8b22baa86dc37e5be5b7dee5b2a - Sigstore transparency entry: 264624327
- Sigstore integration time:
-
Permalink:
HarborYuan/OctoRun@720e47b95b00ce9e0bea50238332b153351128a1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@720e47b95b00ce9e0bea50238332b153351128a1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file octorun-0.1.0-py3-none-any.whl.
File metadata
- Download URL: octorun-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53e1cbd1a3e2499654253fdde0ec9e2b259f6c7d65b0bb1125fef88534f2f2f7
|
|
| MD5 |
e8c6464fce59a6772ee96a270968f9ac
|
|
| BLAKE2b-256 |
c5264d81149d0cecaca839e275bdd1b8a75995e64aa8efee829f095ff47281b7
|
Provenance
The following attestation bundles were made for octorun-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on HarborYuan/OctoRun
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
octorun-0.1.0-py3-none-any.whl -
Subject digest:
53e1cbd1a3e2499654253fdde0ec9e2b259f6c7d65b0bb1125fef88534f2f2f7 - Sigstore transparency entry: 264624329
- Sigstore integration time:
-
Permalink:
HarborYuan/OctoRun@720e47b95b00ce9e0bea50238332b153351128a1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/HarborYuan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@720e47b95b00ce9e0bea50238332b153351128a1 -
Trigger Event:
release
-
Statement type: