Skip to main content

Distributed GPU cluster management system with profiling and model orchestration

Project description

gswarm

A comprehensive distributed GPU cluster management system combining profiling, model storage, and orchestration capabilities.

Overview

gswarm is an integrated platform for managing GPU clusters, providing:

  • GPU Profiling: Multi-node GPU monitoring and performance analysis
  • Model Management: Distributed model storage, deployment, and serving
  • Data Pooling: Efficient data management across nodes
  • Task Orchestration: Queue-based asynchronous task execution

The system uses a host-client architecture where a central host node coordinates operations across multiple client nodes, enabling unified management of your entire GPU infrastructure.

Key Features

Profiling Capabilities

  • Monitor GPU utilization and memory usage across multiple machines
  • Track PCIe bandwidth (GPU-DRAM) and NVLink (GPU-GPU) connections
  • Configurable sampling frequency with JSON output
  • Built on nvitop for accurate GPU metrics
  • Fault tolerance with automatic reconnection
  • Session recovery after crashes

Model Management

  • Distributed model storage across disk, DRAM, and GPU memory
  • Automatic model deployment and serving
  • Cross-node model transfer and replication
  • Support for multiple model frameworks (vLLM, Transformers, TGI)
  • Real-time model status tracking

Data Pool System

  • Distributed data chunk management
  • Automatic data migration between devices
  • Reference counting and garbage collection
  • Transparent cross-node data access
  • Support for model inputs/outputs chaining

Task Queue System

  • Asynchronous task execution with priorities
  • Dependency management and resource conflict detection
  • Parallel execution of independent tasks
  • Automatic retry with exponential backoff

Installation

Prerequisites

  • Python 3.8 or higher
  • NVIDIA GPUs with installed drivers
  • Network connectivity between cluster nodes

Installing gswarm

# Clone the repository
git clone https://github.com/yourusername/gswarm.git
cd gswarm

# Install the package
pip install .

Quick Start

1. Start the Host Node

# Start host with both profiling and model management
gswarm host start --port 8090 --http-port 8091 --model-port 9010

2. Connect Client Nodes

On each GPU machine:

# Connect client with resilient mode
gswarm client connect <host-ip>:8090 --resilient

3. Profile GPU Usage

# Start profiling
gswarm profiler start --name training_run

# Check status
gswarm profiler status

# Stop profiling
gswarm profiler stop --name training_run

4. Manage Models

# List available models
gswarm model list

# Download a model (on host node)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm

# Download a model (on client node, if node-id is not specified, it will download local)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --type llm

# Deploy model to GPU (on client node)
gswarm model move llama-7b --from disk --to gpu0 
# if i am on host, i must specify the node id
gswarm model move llama-7b --from disk --to gpu0 --node node1

# Start model serving (on client node)
# each model requires different method to implement serving, this is provideed in model/instance/xxx.py
# xxx is the model type, we use type to support different model inference methods
gswarm model serve llama-7b --device gpu0 --port 8080
# if i am on host, i must specify the node id
gswarm model serve llama-7b --device gpu0 --port 8080 --node node1

# Check model status
gswarm model status llama-7b
gswarm model status llama-7b --node node1

5. Manage Data

# Create data chunk
gswarm data create --source s3://bucket/data --device dram

# List data chunks
gswarm data list

# Transfer data to another node
gswarm data transfer chunk-123 --to node2:dram

Architecture

System Components

  1. Host Node: Central coordinator

    • Model registry management
    • Task orchestration
    • Global resource tracking
    • API gateway
  2. Client Nodes: Worker nodes

    • Local model storage
    • Model serving
    • GPU profiling
    • Task execution
    • Data pool management
  3. Communication:

    • gRPC for high-performance metric streaming
    • HTTP REST API for control and management
    • WebSocket for real-time updates

Port Configuration

Default ports used by gswarm:

  • gRPC Server: 8090 (profiling metrics)
  • HTTP API: 8091 (control panel)
  • Model API: 9010 (model management)
  • Model Services: 8080+ (dynamic allocation)

CLI Reference

Host Commands

# Host management
gswarm host start [--port PORT] [--http-port HTTP_PORT]
gswarm host stop
gswarm host status

# System overview
gswarm status              # Overall system status
gswarm nodes               # List all nodes
gswarm health              # Health check

Profiler Commands

# Profiling operations
gswarm profiler start [--name NAME] [--freq FREQ]
gswarm profiler stop [--name NAME]
gswarm profiler status
gswarm profiler sessions   # List all sessions
gswarm profiler recover    # Recover crashed sessions

# Analysis
gswarm profiler analyze --data <file.json> --plot <output.pdf>

Model Commands

# Model management
gswarm model list [--location LOCATION]
gswarm model info <model_name>
gswarm model register <model_name> --type TYPE --source URL

# Model operations
gswarm model download <model_name> [--device DEVICE]
gswarm model move <model_name> --from SOURCE --to DEST [--keep-source]
gswarm model copy <model_name> --from SOURCE --to DEST
gswarm model delete <model_name> --device DEVICE

# Model serving
gswarm model serve <model_name> --device DEVICE [--port PORT]
gswarm model stop <model_name>
gswarm model services      # List all running services

Data Commands

# Data pool management
gswarm data list [--device DEVICE]
gswarm data create --source SOURCE --device DEVICE
gswarm data info <chunk_id>
gswarm data move <chunk_id> --to DEVICE
gswarm data transfer <chunk_id> --to NODE:DEVICE
gswarm data delete <chunk_id>

Queue Commands

# Task queue management
gswarm queue status
gswarm queue tasks [--status STATUS]
gswarm queue cancel <task_id>
gswarm queue history [--limit N]

API Reference

Model Management APIs

# List models
GET /api/v1/models

# Get model info
GET /api/v1/models/{model_name}

# Register model
POST /api/v1/models

# Download model
POST /api/v1/models/{model_name}/download

# Move model
POST /api/v1/models/{model_name}/move

# Start serving
POST /api/v1/services

# Get service status
GET /api/v1/services/{service_id}/status

Data Pool APIs

# List data chunks
GET /api/v1/data

# Create data chunk
POST /api/v1/data

# Get chunk info
GET /api/v1/data/{chunk_id}

# Move data
POST /api/v1/data/{chunk_id}/move

# Transfer data
POST /api/v1/data/{chunk_id}/transfer

Queue APIs

# Get queue status
GET /api/v1/queue

# Get task details
GET /api/v1/queue/tasks/{task_id}

# Cancel task
POST /api/v1/queue/tasks/{task_id}/cancel

# Get history
GET /api/v1/queue/history

Configuration

Config File Location

~/.gswarm/config.yaml

Example Configuration

cluster:
  host: "master.cluster.local"
  port: 8090
  
profiling:
  default_frequency: 1000
  enable_bandwidth: true
  enable_nvlink: false
  
models:
  storage_path: "/data/models"
  cache_size: "100GB"
  
queue:
  max_concurrent_tasks: 4
  task_timeout: 3600
  retry_count: 3
  
nodes:
  - name: "node1"
    address: "192.168.1.101"
    capabilities:
      gpus: ["gpu0", "gpu1"]
      storage:
        disk: 1000000000000
        dram: 64000000000
        
  - name: "node2"
    address: "192.168.1.102"
    capabilities:
      gpus: ["gpu0"]
      storage:
        disk: 500000000000
        dram: 32000000000

Example Workflows

Distributed Model Deployment

name: "distributed-deployment"
description: "Deploy model across multiple nodes"

actions:
  # Download model to primary node
  - action_id: "download"
    action_type: "download"
    model_name: "llama-7b"
    target_device: "node1:disk"
    
  # Replicate to other nodes
  - action_id: "replicate_node2"
    action_type: "copy"
    model_name: "llama-7b"
    source_device: "node1:disk"
    target_device: "node2:disk"
    dependencies: ["download"]
    
  # Load models to GPUs
  - action_id: "load_gpu_node1"
    action_type: "move"
    model_name: "llama-7b"
    source_device: "node1:disk"
    target_device: "node1:gpu0"
    dependencies: ["download"]
    
  - action_id: "load_gpu_node2"
    action_type: "move"
    model_name: "llama-7b"
    source_device: "node2:disk"
    target_device: "node2:gpu0"
    dependencies: ["replicate_node2"]
    
  # Start services
  - action_id: "serve_node1"
    action_type: "serve"
    model_name: "llama-7b"
    device: "node1:gpu0"
    port: 8080
    dependencies: ["load_gpu_node1"]
    
  - action_id: "serve_node2"
    action_type: "serve"
    model_name: "llama-7b"
    device: "node2:gpu0"
    port: 8081
    dependencies: ["load_gpu_node2"]

Data Pipeline with Model Chaining

name: "ml-pipeline"
description: "Process data through multiple models"

actions:
  # Prepare input data
  - action_id: "load_data"
    action_type: "data_create"
    source: "s3://bucket/input"
    target_device: "node1:dram"
    
  # First model processing
  - action_id: "model1_process"
    action_type: "inference"
    model_name: "preprocessor"
    input_data: "${load_data.chunk_id}"
    output_device: "node1:dram"
    dependencies: ["load_data"]
    
  # Transfer intermediate data
  - action_id: "transfer_data"
    action_type: "data_transfer"
    data_id: "${model1_process.output}"
    target_device: "node2:dram"
    dependencies: ["model1_process"]
    
  # Second model processing
  - action_id: "model2_process"
    action_type: "inference"
    model_name: "classifier"
    input_data: "${transfer_data.chunk_id}"
    output_device: "node2:dram"
    dependencies: ["transfer_data"]

Monitoring and Troubleshooting

Health Checks

# System health
gswarm health

# Node-specific health
gswarm node status node1

# Service health
gswarm model service-health llama-7b

Logs

Logs are stored in ~/.gswarm/logs/:

  • host.log: Host node logs
  • client-<node>.log: Client node logs
  • profiler.log: Profiling session logs
  • model.log: Model operation logs

Common Issues

  1. Connection Issues

    • Check firewall rules for ports 8090-8091, 9010-9011
    • Verify network connectivity between nodes
    • Use --resilient flag for automatic reconnection
  2. Model Download Failures

    • Check internet connectivity
    • Verify HuggingFace token if needed
    • Check disk space on target device
  3. GPU Memory Issues

    • Monitor GPU memory with gswarm profiler
    • Use model quantization for large models
    • Distribute model across multiple GPUs
  4. Task Queue Blockage

    • Check task dependencies with gswarm queue tasks
    • Look for resource conflicts
    • Cancel stuck tasks with gswarm queue cancel

Migration from Legacy Components

If you're migrating from separate gswarm-profiler and gswarm-model:

  1. Backup existing data:

    cp -r ~/.gswarm_profiler_data ~/.gswarm_profiler_data.backup
    cp -r ~/.gswarm_model_data ~/.gswarm_model_data.backup
    
  2. Update CLI commands:

    • gsprofgswarm profiler
    • gsmodelgswarm model
  3. Update API endpoints:

    • Model APIs now use /api/v1/ prefix
    • Same ports are maintained for compatibility

See the Migration Guide for detailed instructions.

Development

Running Tests

# Run all tests
pytest

# Run specific test suite
pytest tests/test_profiler.py
pytest tests/test_model.py
pytest tests/test_queue.py

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Documentation

License

MIT License - see LICENSE file for details

Acknowledgments

  • Built on nvitop for GPU monitoring
  • Inspired by distributed computing frameworks
  • Thanks to all contributors

Roadmap

  • Kubernetes operator for cluster deployment
  • Web UI for cluster management
  • Advanced scheduling algorithms
  • Model optimization toolkit
  • Integration with popular ML frameworks
  • Multi-cloud support

For more information, see the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gswarm-0.4.1.tar.gz (440.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gswarm-0.4.1-py3-none-any.whl (112.4 kB view details)

Uploaded Python 3

File details

Details for the file gswarm-0.4.1.tar.gz.

File metadata

  • Download URL: gswarm-0.4.1.tar.gz
  • Upload date:
  • Size: 440.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gswarm-0.4.1.tar.gz
Algorithm Hash digest
SHA256 e09fa45d7b77b3d2144a53d38fabe60e18082c6134302f5d397d49e6f13941f0
MD5 2865bc9e9ed457eb550574b3a8ceae24
BLAKE2b-256 7928265abf49daf0a443ae8db18464624268deb54e6a717887cdf307269da89f

See more details on using hashes here.

Provenance

The following attestation bundles were made for gswarm-0.4.1.tar.gz:

Publisher: publish-to-pypi.yml on Chivier/gswarm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gswarm-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: gswarm-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 112.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gswarm-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 48d7dc8e0439b67904ed98d1d7dad60cabee70c3a15c66ea066a94298b670ea0
MD5 23615007fe6bb139d9c11f3339a14cf8
BLAKE2b-256 cb7506f3deccf45e14551cc2ba25a8188ec31dba13bdddc55e4fbfb3fe00e642

See more details on using hashes here.

Provenance

The following attestation bundles were made for gswarm-0.4.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on Chivier/gswarm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page