Paper - Pytorch
Project description
ClusterOps
ClusterOps is an enterprise-grade Python library developed and maintained by the Swarms Team to help you manage and execute agents on specific CPUs and GPUs across clusters. This tool enables advanced CPU and GPU selection, dynamic task allocation, and resource monitoring, making it ideal for high-performance distributed computing environments.
Features
- CPU Execution: Dynamically assign tasks to specific CPU cores.
- GPU Execution: Execute tasks on specific GPUs or dynamically select the best available GPU based on memory usage.
- Fault Tolerance: Built-in retry logic with exponential backoff for handling transient errors.
- Resource Monitoring: Real-time CPU and GPU resource monitoring (e.g., free memory on GPUs).
- Logging: Advanced logging configuration with customizable log levels (DEBUG, INFO, ERROR).
- Scalability: Supports multi-GPU task execution with Ray for distributed computation.
Table of Contents
Installation
Prerequisites
- Python 3.8 or higher
psutil
: CPU monitoring and task managementGPUtil
: GPU resource monitoringRay
: Distributed execution framework
You can install the required dependencies using pip
:
pip install -r requirements.txt
The requirements.txt
file includes:
loguru>=0.6.0
psutil>=5.8.0
gputil>=1.4.0
ray>=2.0.0
Installing ClusterOps
Clone the repository:
git clone https://github.com/swarms-team/clusterops.git
cd clusterops
Then, install the package locally:
pip install .
Quick Start
The following example demonstrates how to use ClusterOps to run tasks on specific CPUs and GPUs.
from clusterops import execute_with_cpu_cores, execute_on_gpu, retry_with_backoff
# Sample function to execute
def sample_task(n: int) -> int:
return n * n
# Run the task on 4 CPU cores
result_cpu = execute_with_cpu_cores(4, sample_task, 10)
print(f"Result on CPU cores: {result_cpu}")
# Run the task on the best available GPU with retries
result_gpu = retry_with_backoff(execute_on_gpu, None, sample_task, 10)
print(f"Result on GPU: {result_gpu}")
Usage
Executing on Specific CPUs
You can execute a task on a specific number of CPU cores using the execute_with_cpu_cores()
function. It automatically adjusts CPU affinity on systems where this feature is supported.
from clusterops import execute_with_cpu_cores
def sample_task(n: int) -> int:
return n * n
# Execute the task using 4 CPU cores
result = execute_with_cpu_cores(4, sample_task, 10)
print(f"Result on 4 CPU cores: {result}")
Executing on Specific GPUs
ClusterOps supports running tasks on specific GPUs or dynamically selecting the best available GPU (based on free memory).
from clusterops import execute_on_gpu
def sample_task(n: int) -> int:
return n * n
# Execute the task on GPU with ID 1
result = execute_on_gpu(1, sample_task, 10)
print(f"Result on GPU 1: {result}")
# Execute the task on the best available GPU
result_best_gpu = execute_on_gpu(None, sample_task, 10)
print(f"Result on best available GPU: {result_best_gpu}")
Retry Logic and Fault Tolerance
For production environments, ClusterOps includes retry logic with exponential backoff, which retries a task in case of failures.
from clusterops import retry_with_backoff, execute_on_gpu
# Run task on the best GPU with retry logic
result = retry_with_backoff(execute_on_gpu, None, sample_task, 10)
print(f"Result with retry: {result}")
Configuration
ClusterOps provides configuration through environment variables, making it adaptable for different environments (development, staging, production).
Environment Variables
LOG_LEVEL
: Configures logging verbosity. Options:DEBUG
,INFO
,ERROR
. Default isINFO
.RETRY_COUNT
: Number of times to retry a task in case of failure. Default is 3.RETRY_DELAY
: Initial delay in seconds before retrying. Default is 1 second.
Set these variables in your environment:
export LOG_LEVEL=DEBUG
export RETRY_COUNT=5
export RETRY_DELAY=2.0
Contributing
We welcome contributions to ClusterOps! If you'd like to contribute, please follow these steps:
- Fork the repository on GitHub.
- Clone your fork locally:
git clone https://github.com/your-username/clusterops.git cd clusterops
- Create a feature branch for your changes:
git checkout -b feature/new-feature
- Install the development dependencies:
pip install -r dev-requirements.txt
- Make your changes, and be sure to include tests.
- Run tests to ensure everything works:
pytest
- Commit your changes and push them to GitHub:
git commit -m "Add new feature" git push origin feature/new-feature
- Submit a pull request on GitHub, and we’ll review it as soon as possible.
Reporting Issues
If you encounter any issues, please create a GitHub issue.
License
ClusterOps is licensed under the MIT License. See the LICENSE file for more details.
Contact
For any questions, feedback, or contributions, please contact the Swarms Team at contact@swarms.world.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for clusterops-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea7eef1813d4df2e558042425c9f09509c64a08961cfea15d0dfabd426dc4c26 |
|
MD5 | dcc1e04b7d0774b00af3c898acdc7b4d |
|
BLAKE2b-256 | bdcc7e135ee31b6f30cfba0458ab2533a7227d2fa2fef6b1987767ef45d90c08 |