Skip to main content

Paper - Pytorch

Project description

ClusterOps

License Python Version Build Status Coverage Status

ClusterOps is an enterprise-grade Python library developed and maintained by the Swarms Team to help you manage and execute agents on specific CPUs and GPUs across clusters. This tool enables advanced CPU and GPU selection, dynamic task allocation, and resource monitoring, making it ideal for high-performance distributed computing environments.

Join our Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com


Features

  • CPU Execution: Dynamically assign tasks to specific CPU cores.
  • GPU Execution: Execute tasks on specific GPUs or dynamically select the best available GPU based on memory usage.
  • Fault Tolerance: Built-in retry logic with exponential backoff for handling transient errors.
  • Resource Monitoring: Real-time CPU and GPU resource monitoring (e.g., free memory on GPUs).
  • Logging: Advanced logging configuration with customizable log levels (DEBUG, INFO, ERROR).
  • Scalability: Supports multi-GPU task execution with Ray for distributed computation.

Table of Contents


Installation

Prerequisites

  • Python 3.8 or higher
  • psutil: CPU monitoring and task management
  • GPUtil: GPU resource monitoring
  • Ray: Distributed execution framework

You can install the required dependencies using pip:

pip install -r requirements.txt

The requirements.txt file includes:

loguru>=0.6.0
psutil>=5.8.0
gputil>=1.4.0
ray>=2.0.0

Installing ClusterOps

Clone the repository:

git clone https://github.com/swarms-team/clusterops.git
cd clusterops

Then, install the package locally:

pip install .

Quick Start

The following example demonstrates how to use ClusterOps to run tasks on specific CPUs and GPUs.

from clusterops import execute_with_cpu_cores, execute_on_gpu, retry_with_backoff

# Sample function to execute
def sample_task(n: int) -> int:
    return n * n

# Run the task on 4 CPU cores
result_cpu = execute_with_cpu_cores(4, sample_task, 10)
print(f"Result on CPU cores: {result_cpu}")

# Run the task on the best available GPU with retries
result_gpu = retry_with_backoff(execute_on_gpu, None, sample_task, 10)
print(f"Result on GPU: {result_gpu}")

Usage

Executing on Specific CPUs

You can execute a task on a specific number of CPU cores using the execute_with_cpu_cores() function. It automatically adjusts CPU affinity on systems where this feature is supported.

from clusterops import execute_with_cpu_cores

def sample_task(n: int) -> int:
    return n * n

# Execute the task using 4 CPU cores
result = execute_with_cpu_cores(4, sample_task, 10)
print(f"Result on 4 CPU cores: {result}")

Executing on Specific GPUs

ClusterOps supports running tasks on specific GPUs or dynamically selecting the best available GPU (based on free memory).

from clusterops import execute_on_gpu

def sample_task(n: int) -> int:
    return n * n

# Execute the task on GPU with ID 1
result = execute_on_gpu(1, sample_task, 10)
print(f"Result on GPU 1: {result}")

# Execute the task on the best available GPU
result_best_gpu = execute_on_gpu(None, sample_task, 10)
print(f"Result on best available GPU: {result_best_gpu}")

Retry Logic and Fault Tolerance

For production environments, ClusterOps includes retry logic with exponential backoff, which retries a task in case of failures.

from clusterops import retry_with_backoff, execute_on_gpu

# Run task on the best GPU with retry logic
result = retry_with_backoff(execute_on_gpu, None, sample_task, 10)
print(f"Result with retry: {result}")

Configuration

ClusterOps provides configuration through environment variables, making it adaptable for different environments (development, staging, production).

Environment Variables

  • LOG_LEVEL: Configures logging verbosity. Options: DEBUG, INFO, ERROR. Default is INFO.
  • RETRY_COUNT: Number of times to retry a task in case of failure. Default is 3.
  • RETRY_DELAY: Initial delay in seconds before retrying. Default is 1 second.

Set these variables in your environment:

export LOG_LEVEL=DEBUG
export RETRY_COUNT=5
export RETRY_DELAY=2.0

Contributing

We welcome contributions to ClusterOps! If you'd like to contribute, please follow these steps:

  1. Fork the repository on GitHub.
  2. Clone your fork locally:
    git clone https://github.com/your-username/clusterops.git
    cd clusterops
    
  3. Create a feature branch for your changes:
    git checkout -b feature/new-feature
    
  4. Install the development dependencies:
    pip install -r dev-requirements.txt
    
  5. Make your changes, and be sure to include tests.
  6. Run tests to ensure everything works:
    pytest
    
  7. Commit your changes and push them to GitHub:
    git commit -m "Add new feature"
    git push origin feature/new-feature
    
  8. Submit a pull request on GitHub, and we’ll review it as soon as possible.

Reporting Issues

If you encounter any issues, please create a GitHub issue.


License

ClusterOps is licensed under the MIT License. See the LICENSE file for more details.


Contact

For any questions, feedback, or contributions, please contact the Swarms Team at contact@swarms.world.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clusterops-0.0.1.tar.gz (6.1 kB view hashes)

Uploaded Source

Built Distribution

clusterops-0.0.1-py3-none-any.whl (6.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page