Distributed GPU cluster management system with profiling and model orchestration
Project description
gswarm
A comprehensive distributed GPU cluster management system combining profiling, model storage, and orchestration capabilities.
Overview
gswarm is an integrated platform for managing GPU clusters, providing:
- GPU Profiling: Multi-node GPU monitoring and performance analysis
- Model Management: Distributed model storage, deployment, and serving
- Data Pooling: Efficient data management across nodes
- Task Orchestration: Queue-based asynchronous task execution
The system uses a host-client architecture where a central host node coordinates operations across multiple client nodes, enabling unified management of your entire GPU infrastructure.
Key Features
Profiling Capabilities
- Monitor GPU utilization and memory usage across multiple machines
- Track PCIe bandwidth (GPU-DRAM) and NVLink (GPU-GPU) connections
- Configurable sampling frequency with JSON output
- Built on nvitop for accurate GPU metrics
- Fault tolerance with automatic reconnection
- Session recovery after crashes
Model Management
- Distributed model storage across disk, DRAM, and GPU memory
- Automatic model deployment and serving
- Cross-node model transfer and replication
- Support for multiple model frameworks (vLLM, Transformers, TGI)
- Real-time model status tracking
Data Pool System
- Distributed data chunk management
- Automatic data migration between devices
- Reference counting and garbage collection
- Transparent cross-node data access
- Support for model inputs/outputs chaining
Task Queue System
- Asynchronous task execution with priorities
- Dependency management and resource conflict detection
- Parallel execution of independent tasks
- Automatic retry with exponential backoff
Installation
Prerequisites
- Python 3.8 or higher
- NVIDIA GPUs with installed drivers
- Network connectivity between cluster nodes
Installing gswarm
# Clone the repository
git clone https://github.com/yourusername/gswarm.git
cd gswarm
# Install the package
pip install .
Quick Start
1. Start the Host Node
# Start host with both profiling and model management
gswarm host start --port 8090 --http-port 8091 --model-port 9010
2. Connect Client Nodes
On each GPU machine:
# Connect client with resilient mode
gswarm client connect <host-ip>:8090 --resilient
3. Profile GPU Usage
# Start profiling
gswarm profiler start --name training_run
# Check status
gswarm profiler status
# Stop profiling
gswarm profiler stop --name training_run
4. Manage Models
# List available models
gswarm model list
# Download a model (on host node)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm
# Download a model (on client node, if node-id is not specified, it will download local)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --type llm
# Deploy model to GPU (on client node)
gswarm model move llama-7b --from disk --to gpu0
# if i am on host, i must specify the node id
gswarm model move llama-7b --from disk --to gpu0 --node node1
# Start model serving (on client node)
# each model requires different method to implement serving, this is provideed in model/instance/xxx.py
# xxx is the model type, we use type to support different model inference methods
gswarm model serve llama-7b --device gpu0 --port 8080
# if i am on host, i must specify the node id
gswarm model serve llama-7b --device gpu0 --port 8080 --node node1
# Check model status
gswarm model status llama-7b
gswarm model status llama-7b --node node1
5. Manage Data
# Create data chunk
gswarm data create --source s3://bucket/data --device dram
# List data chunks
gswarm data list
# Transfer data to another node
gswarm data transfer chunk-123 --to node2:dram
Architecture
System Components
-
Host Node: Central coordinator
- Model registry management
- Task orchestration
- Global resource tracking
- API gateway
-
Client Nodes: Worker nodes
- Local model storage
- Model serving
- GPU profiling
- Task execution
- Data pool management
-
Communication:
- gRPC for high-performance metric streaming
- HTTP REST API for control and management
- WebSocket for real-time updates
Port Configuration
Default ports used by gswarm:
- gRPC Server: 8090 (profiling metrics)
- HTTP API: 8091 (control panel)
- Model API: 9010 (model management)
- Model Services: 8080+ (dynamic allocation)
CLI Reference
Host Commands
# Host management
gswarm host start [--port PORT] [--http-port HTTP_PORT]
gswarm host stop
gswarm host status
# System overview
gswarm status # Overall system status
gswarm nodes # List all nodes
gswarm health # Health check
Profiler Commands
# Profiling operations
gswarm profiler start [--name NAME] [--freq FREQ]
gswarm profiler stop [--name NAME]
gswarm profiler status
gswarm profiler sessions # List all sessions
gswarm profiler recover # Recover crashed sessions
# Analysis
gswarm profiler analyze --data <file.json> --plot <output.pdf>
Model Commands
# Model management
gswarm model list [--location LOCATION]
gswarm model info <model_name>
gswarm model register <model_name> --type TYPE --source URL
# Model operations
gswarm model download <model_name> [--device DEVICE]
gswarm model move <model_name> --from SOURCE --to DEST [--keep-source]
gswarm model copy <model_name> --from SOURCE --to DEST
gswarm model delete <model_name> --device DEVICE
# Model serving
gswarm model serve <model_name> --device DEVICE [--port PORT]
gswarm model stop <model_name>
gswarm model services # List all running services
Data Commands
# Data pool management
gswarm data list [--device DEVICE]
gswarm data create --source SOURCE --device DEVICE
gswarm data info <chunk_id>
gswarm data move <chunk_id> --to DEVICE
gswarm data transfer <chunk_id> --to NODE:DEVICE
gswarm data delete <chunk_id>
Queue Commands
# Task queue management
gswarm queue status
gswarm queue tasks [--status STATUS]
gswarm queue cancel <task_id>
gswarm queue history [--limit N]
API Reference
Model Management APIs
# List models
GET /api/v1/models
# Get model info
GET /api/v1/models/{model_name}
# Register model
POST /api/v1/models
# Download model
POST /api/v1/models/{model_name}/download
# Move model
POST /api/v1/models/{model_name}/move
# Start serving
POST /api/v1/services
# Get service status
GET /api/v1/services/{service_id}/status
Data Pool APIs
# List data chunks
GET /api/v1/data
# Create data chunk
POST /api/v1/data
# Get chunk info
GET /api/v1/data/{chunk_id}
# Move data
POST /api/v1/data/{chunk_id}/move
# Transfer data
POST /api/v1/data/{chunk_id}/transfer
Queue APIs
# Get queue status
GET /api/v1/queue
# Get task details
GET /api/v1/queue/tasks/{task_id}
# Cancel task
POST /api/v1/queue/tasks/{task_id}/cancel
# Get history
GET /api/v1/queue/history
Configuration
Config File Location
~/.gswarm/config.yaml
Example Configuration
cluster:
host: "master.cluster.local"
port: 8090
profiling:
default_frequency: 1000
enable_bandwidth: true
enable_nvlink: false
models:
storage_path: "/data/models"
cache_size: "100GB"
queue:
max_concurrent_tasks: 4
task_timeout: 3600
retry_count: 3
nodes:
- name: "node1"
address: "192.168.1.101"
capabilities:
gpus: ["gpu0", "gpu1"]
storage:
disk: 1000000000000
dram: 64000000000
- name: "node2"
address: "192.168.1.102"
capabilities:
gpus: ["gpu0"]
storage:
disk: 500000000000
dram: 32000000000
Example Workflows
Distributed Model Deployment
name: "distributed-deployment"
description: "Deploy model across multiple nodes"
actions:
# Download model to primary node
- action_id: "download"
action_type: "download"
model_name: "llama-7b"
target_device: "node1:disk"
# Replicate to other nodes
- action_id: "replicate_node2"
action_type: "copy"
model_name: "llama-7b"
source_device: "node1:disk"
target_device: "node2:disk"
dependencies: ["download"]
# Load models to GPUs
- action_id: "load_gpu_node1"
action_type: "move"
model_name: "llama-7b"
source_device: "node1:disk"
target_device: "node1:gpu0"
dependencies: ["download"]
- action_id: "load_gpu_node2"
action_type: "move"
model_name: "llama-7b"
source_device: "node2:disk"
target_device: "node2:gpu0"
dependencies: ["replicate_node2"]
# Start services
- action_id: "serve_node1"
action_type: "serve"
model_name: "llama-7b"
device: "node1:gpu0"
port: 8080
dependencies: ["load_gpu_node1"]
- action_id: "serve_node2"
action_type: "serve"
model_name: "llama-7b"
device: "node2:gpu0"
port: 8081
dependencies: ["load_gpu_node2"]
Data Pipeline with Model Chaining
name: "ml-pipeline"
description: "Process data through multiple models"
actions:
# Prepare input data
- action_id: "load_data"
action_type: "data_create"
source: "s3://bucket/input"
target_device: "node1:dram"
# First model processing
- action_id: "model1_process"
action_type: "inference"
model_name: "preprocessor"
input_data: "${load_data.chunk_id}"
output_device: "node1:dram"
dependencies: ["load_data"]
# Transfer intermediate data
- action_id: "transfer_data"
action_type: "data_transfer"
data_id: "${model1_process.output}"
target_device: "node2:dram"
dependencies: ["model1_process"]
# Second model processing
- action_id: "model2_process"
action_type: "inference"
model_name: "classifier"
input_data: "${transfer_data.chunk_id}"
output_device: "node2:dram"
dependencies: ["transfer_data"]
Monitoring and Troubleshooting
Health Checks
# System health
gswarm health
# Node-specific health
gswarm node status node1
# Service health
gswarm model service-health llama-7b
Logs
Logs are stored in ~/.gswarm/logs/:
host.log: Host node logsclient-<node>.log: Client node logsprofiler.log: Profiling session logsmodel.log: Model operation logs
Common Issues
-
Connection Issues
- Check firewall rules for ports 8090-8091, 9010-9011
- Verify network connectivity between nodes
- Use
--resilientflag for automatic reconnection
-
Model Download Failures
- Check internet connectivity
- Verify HuggingFace token if needed
- Check disk space on target device
-
GPU Memory Issues
- Monitor GPU memory with
gswarm profiler - Use model quantization for large models
- Distribute model across multiple GPUs
- Monitor GPU memory with
-
Task Queue Blockage
- Check task dependencies with
gswarm queue tasks - Look for resource conflicts
- Cancel stuck tasks with
gswarm queue cancel
- Check task dependencies with
Migration from Legacy Components
If you're migrating from separate gswarm-profiler and gswarm-model:
-
Backup existing data:
cp -r ~/.gswarm_profiler_data ~/.gswarm_profiler_data.backup cp -r ~/.gswarm_model_data ~/.gswarm_model_data.backup
-
Update CLI commands:
gsprof→gswarm profilergsmodel→gswarm model
-
Update API endpoints:
- Model APIs now use
/api/v1/prefix - Same ports are maintained for compatibility
- Model APIs now use
See the Migration Guide for detailed instructions.
Development
Running Tests
# Run all tests
pytest
# Run specific test suite
pytest tests/test_profiler.py
pytest tests/test_model.py
pytest tests/test_queue.py
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
Documentation
License
MIT License - see LICENSE file for details
Acknowledgments
- Built on nvitop for GPU monitoring
- Inspired by distributed computing frameworks
- Thanks to all contributors
Roadmap
- Kubernetes operator for cluster deployment
- Web UI for cluster management
- Advanced scheduling algorithms
- Model optimization toolkit
- Integration with popular ML frameworks
- Multi-cloud support
For more information, see the documentation.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gswarm-0.4.0.tar.gz.
File metadata
- Download URL: gswarm-0.4.0.tar.gz
- Upload date:
- Size: 413.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9b7f6de408e0e49f2f66efb52ee425b254f5f353bb268c0a8bef0ee7c36ae51
|
|
| MD5 |
ef525f6bba2763b9ab571b33dadb8c21
|
|
| BLAKE2b-256 |
a952c9c4fd604c5c9bbcd625acec838de9a686cf8ed16d80c3523479dc47e83e
|
Provenance
The following attestation bundles were made for gswarm-0.4.0.tar.gz:
Publisher:
publish-to-pypi.yml on Chivier/gswarm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gswarm-0.4.0.tar.gz -
Subject digest:
c9b7f6de408e0e49f2f66efb52ee425b254f5f353bb268c0a8bef0ee7c36ae51 - Sigstore transparency entry: 241031619
- Sigstore integration time:
-
Permalink:
Chivier/gswarm@d54332b9bfa19bacf9fe7777fd7747df1ddddd17 -
Branch / Tag:
refs/tags/v0.4.0-alpha - Owner: https://github.com/Chivier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@d54332b9bfa19bacf9fe7777fd7747df1ddddd17 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gswarm-0.4.0-py3-none-any.whl.
File metadata
- Download URL: gswarm-0.4.0-py3-none-any.whl
- Upload date:
- Size: 108.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b1fdb72408092e407a2ec7b1a094dec8fa35ea642536ab79e1abf570c9ed8a3
|
|
| MD5 |
5d6d09dd287e6ad3f80e9b8c49899a96
|
|
| BLAKE2b-256 |
e5eee22a66115d774c9e442fa2406de23d566bef622c903e56206b3bbe911b0c
|
Provenance
The following attestation bundles were made for gswarm-0.4.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on Chivier/gswarm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gswarm-0.4.0-py3-none-any.whl -
Subject digest:
7b1fdb72408092e407a2ec7b1a094dec8fa35ea642536ab79e1abf570c9ed8a3 - Sigstore transparency entry: 241031620
- Sigstore integration time:
-
Permalink:
Chivier/gswarm@d54332b9bfa19bacf9fe7777fd7747df1ddddd17 -
Branch / Tag:
refs/tags/v0.4.0-alpha - Owner: https://github.com/Chivier
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@d54332b9bfa19bacf9fe7777fd7747df1ddddd17 -
Trigger Event:
release
-
Statement type: