Decentralized P2P orchestration layer for fine-tuning ML models across commodity hardware
Project description
Overview
Swarm-Tune lets a group of people pool their gaming GPUs over the internet to collaboratively fine-tune a large language model — with no data center, no cloud bill, and no central authority.
Each participant runs one node. Every node holds a shard of the model and a shard of the dataset, trains locally, extracts raw gradients, and broadcasts them to peers over a libp2p P2P network. The swarm runs Federated Averaging every round; all nodes stay in sync.
20 participants × RTX 3090 (24 GB VRAM) = 480 GB pooled VRAM
LLaMA 3 70B requires ~140 GB → fits in the swarm, not on any single machine
The system is fully decentralized — no master node, no tracker, no coordinator. Nodes discover each other via Kademlia DHT, tolerate stragglers and failures with timeout-based partial aggregation, and defend against adversarial participants with gradient validation and Sybil resistance. Competing swarms can race on the same model and dataset; winner is determined by perplexity, publicly verifiable by anyone.
Quick Start
Join an existing training run (requires Docker):
# Step 1 — Generate your .env and startup command
python scripts/join.py --run-id gpt2-wikitrain-001 --node-index <N>
# Step 2 — Start your node
docker run --rm --env-file my.env \
-p 9000:9000 \
-v ./checkpoints:/app/checkpoints \
yashasviudayan/swarm-tune:latest
Run the 6-node local simulation (no internet required):
git clone https://github.com/yashasviudayan-py/Swarm-Tune
cd Swarm-Tune
make sim-up # starts 5 honest + 1 adversarial node in Docker
make sim-logs # stream structured JSON logs
make sim-kill-node NODE=swarm_node_2 # chaos: kill a node mid-training
Install the Python package:
pip install swarm-tune
swarm-tune --help
How It Works
Each training round:
Each node independently:
1. Sample a mini-batch from its local data shard
2. Forward pass → compute loss → loss.backward()
3. Extract param.grad tensors, validate (NaN/Inf/norm bounds)
4. Compress → serialize (SWRM wire format) → chunk into ≤60 KB frames
5. Broadcast chunks over libp2p FloodSub
Simultaneously, receive from peers:
6. Reassemble chunks → deserialize (weights_only=True) → decompress
7. Validate each peer gradient (reject NaN/Inf/outliers/wrong shape)
8. Submit to TimeoutAggregator (hard 30s window)
After timeout or quorum reached:
9. Weighted FedAvg (weighted by dataset_size per peer)
10. Apply averaged gradients → optimizer.step()
Straggler handling:
- ≥ min_peers respond → commit round
- < min_peers respond → fall back to local gradient, no round wasted
- Dead nodes evicted via heartbeat after 60s, welcomed back on rejoin
Why Not PyTorch DDP?
DistributedDataParallel assumes microsecond data-center latency, homogeneous always-on nodes, and NCCL/Gloo transport. The internet is none of those things. This project builds the gradient exchange layer from scratch: custom wire protocol, chunked framing (Noise protocol 65 KB frame limit), timeout aggregation, and federated averaging — all of which DDP abstracts away, but which you need to understand to do distributed training in the wild.
Features
| Capability | Detail |
|---|---|
| Any HuggingFace model | SWARM_MODEL_NAME=gpt2 or llama3 — config-only, no code change |
| Any HuggingFace dataset | Deterministic sharding via dataset.shard() |
| NAT traversal | libp2p circuit-relay + dcutr hole-punching for home internet nodes |
| Straggler tolerance | Timeout-based partial aggregation — one dead node never blocks the swarm |
| Gradient poisoning defense | NaN/Inf/RMS-norm validation before any peer gradient reaches the averager |
| Sybil resistance | Subnet contribution cap (/24) + per-peer rejection rate ban |
| Transparent chunking | Noise protocol 65 KB frame limit handled automatically |
| Live dashboard | Static HTML, no build step — force-directed peer graph, loss curves, bytes throughput |
| Zero-config join | python scripts/join.py --run-id X --node-index N generates your .env |
| Competition mode | Two swarms race on the same model/dataset; winner by perplexity, publicly verifiable |
| Relay node | SWARM_RELAY_MODE=true — run a bootstrap VPS with no training, stable peer ID |
| Checkpoint tools | Reconstruct full model from shards; publish to HuggingFace Hub with model card |
| 110 tests | Unit, integration, chaos (fault injection, adversarial rejection, node drop/rejoin) |
Architecture
src/swarm_tune/
├── config/settings.py NodeSettings — all SWARM_ env vars, pydantic validators
├── runs/manifest.py RunManifest — training campaign definition, .env generator
└── node/
├── main.py SwarmNode training loop + RelayNode + CLI entrypoint
├── metrics.py MetricsStore + anyio TCP /metrics sidecar
├── p2p/
│ ├── discovery.py libp2p host, Ed25519 keys, mDNS, Kademlia DHT, relay dialing
│ ├── gossip.py GossipProtocol — FloodSub, chunked framing, eviction loop
│ ├── heartbeat.py Liveness signals + stale peer eviction
│ └── peer_selector.py AllPeersSelector + BanList (rejection rate tracking)
├── trainer/
│ ├── model.py ModelShard — HF AutoModelForCausalLM or toy MLP + layer sharding
│ ├── data.py DataShardLoader (.pt) + HFDataShardLoader (HF datasets)
│ ├── gradient.py GradientExtractor — extract + validate param.grad tensors
│ ├── serializer.py GradientSerializer — SWRM wire format, weights_only=True
│ └── compressor.py Compressor protocol: Identity (now) → TopK (bandwidth scale-up)
└── aggregator/
├── averaging.py GradientAverager — weighted FedAvg + Sybil subnet cap
├── timeout.py TimeoutAggregator — partial aggregation + rate limiting
└── strategy.py AggregationStrategy: Flat (now) → Hierarchical (100+ nodes)
runs/
├── gpt2-wikitrain-001.json 4-node data-parallel GPT-2/WikiText-103 run
├── gpt2-competition-001.json 50-round competition manifest
└── gpt2-competition-2v2.json 2-node-per-team competition manifest
scripts/
├── join.py Zero-config participant onboarding
├── run_competition.py Orchestrate two-team competition, write JSON result
├── set_bootstrap.py Bake relay VPS multiaddr into all manifests
├── reconstruct_checkpoint.py Merge/average shard checkpoints → full model
├── publish_checkpoint.py Push checkpoint + model card to HuggingFace Hub
├── benchmark.py Perplexity evaluation on WikiText-103 test split
└── generate_shards.py Generate synthetic .pt shards for local simulation
docker/
├── Dockerfile Multi-stage build, non-root user, dynamic health check
├── docker-compose.yml 6-node simulation (5 honest + 1 adversarial)
└── docker-compose.relay.yml Single relay/bootstrap node for VPS deployment
dashboard/index.html Static vanilla-JS dashboard — no build step
Extensibility Abstractions
Three Protocol interfaces designed for zero-friction scale-up. The training loop never changes — only the implementation behind the protocol.
| Protocol | Now (≤30 nodes) | Scale-up (100+ nodes) | How to swap |
|---|---|---|---|
Compressor |
IdentityCompressor (no-op) |
TopKCompressor (~50× bandwidth reduction at 1%) |
SWARM_COMPRESSION=topk |
PeerSelector |
AllPeersSelector + BanList |
ClusterPeerSelector |
SWARM_AGGREGATION_STRATEGY=hierarchical |
AggregationStrategy |
FlatAggregation |
HierarchicalAggregation |
SWARM_AGGREGATION_STRATEGY=hierarchical |
Public Deployment
Hosting a relay/bootstrap node (VPS — $5/month)
A relay node gives participants a stable public multiaddr to connect to. It runs P2P only — no model loading, no training.
# On your VPS
echo "SWARM_NODE_KEY_SEED=your_long_secret_here" > relay.env
make relay-up # docker compose -f docker/docker-compose.relay.yml up -d
make relay-logs | grep Multiaddr # copy /ip4/<vps-ip>/tcp/9000/p2p/12D3KooW...
Once you have the multiaddr, bake it into all run manifests so join.py works with zero extra flags:
make set-bootstrap PEER="/ip4/<vps-ip>/tcp/9000/p2p/12D3KooW..."
git commit -am "chore: set bootstrap peer" && git push
After that, any participant worldwide can join with a single command — no --bootstrap-peer flag needed.
Participant onboarding (their machine)
git clone https://github.com/yashasviudayan-py/Swarm-Tune
python scripts/join.py --run-id gpt2-wikitrain-001 --node-index 2 --device cuda
# → writes my.env
# → prints: docker run --env-file my.env -p 9000:9000 yashasviudayan/swarm-tune:latest
Running a competition
make competition \
COMPETITION_ID=gpt2-comp-001 \
TEAM_A_ID=team-alpha TEAM_A_CHECKPOINT=ckpts/alpha.pt \
TEAM_B_ID=team-beta TEAM_B_CHECKPOINT=ckpts/beta.pt
Results are written to results/competition_result.json. Anyone can independently verify by running make benchmark CHECKPOINT=<downloaded-checkpoint> against a published HuggingFace Hub checkpoint.
Development
Prerequisites: Python 3.12+, Docker, brew install gmp (macOS) or apt install libgmp-dev (Linux)
git clone https://github.com/yashasviudayan-py/Swarm-Tune
cd Swarm-Tune
make bootstrap # venv + deps + pre-commit hooks
make check # ruff lint + format check + mypy --strict
make test # 110 unit + integration tests
make test-chaos # 10 fault injection tests (node drop, adversarial, etc.)
make sim-up # 6-node Docker simulation
All Makefile targets
# Code quality
make check # lint + format + types
make format # auto-format with ruff
# Testing
make test # unit + integration (fast)
make test-chaos # fault injection (slow, real timeouts)
make test-all # everything
make coverage # HTML coverage report → htmlcov/
# Simulation
make sim-up # 6-node Docker swarm
make sim-down # stop and remove containers
make sim-logs # tail all container logs
make sim-kill-node NODE=swarm_node_2
# Training runs
make join RUN_ID=gpt2-wikitrain-001 NODE_INDEX=0
make reconstruct CHECKPOINT_DIR=checkpoints/ MODEL=gpt2
make publish CHECKPOINT=checkpoints/full_model.pt REPO_ID=user/model
make benchmark CHECKPOINT=checkpoints/node_0_final.pt
# Competition
make competition COMPETITION_ID=... TEAM_A_ID=... TEAM_A_CHECKPOINT=... \
TEAM_B_ID=... TEAM_B_CHECKPOINT=...
# Public deployment
make relay-up # start relay node (requires relay.env)
make relay-logs # tail relay logs, copy the multiaddr
make set-bootstrap PEER="/ip4/.../tcp/9000/p2p/12D3KooW..."
Project Status
All 8 phases complete. Production-audited. v1.0.0 released.
| Phase | Description | Status |
|---|---|---|
| 1 | P2P Network — libp2p, Ed25519, mDNS, Kademlia, heartbeat, peer eviction | ✅ |
| 2 | Gradient Extraction — SWRM protocol, weights_only=True, FedAvg |
✅ |
| 3 | Gradient Sync — FloodSub, chunked framing, TimeoutAggregator |
✅ |
| 4 | Docker Simulation — 6-node sim, chaos tests, adversarial rejection | ✅ |
| 5 | Internet Deployment — HF models/datasets, NAT traversal, Sybil resistance, /metrics | ✅ |
| 6 | Live Dashboard — force-directed graph, persistent loss curves, bytes tracking | ✅ |
| 7 | Distribution — RunManifest, join.py, checkpoint reconstruction, HF Hub publish |
✅ |
| 8 | Competition — run_competition.py, 2v2 manifest, make competition, relay node |
✅ |
Test coverage: 110 tests (unit + integration + chaos). mypy --strict clean. ruff clean.
Docker Hub: yashasviudayan/swarm-tune — latest and 1.0.0 tags available for linux/amd64 and linux/arm64.
Security Properties
| Property | Mechanism |
|---|---|
| Cryptographic peer identity | Ed25519 key pairs via libp2p — spoofing requires breaking the key |
| No pickle deserialization | torch.load(..., weights_only=True) on all peer data |
| Wire format validation | SWRM magic bytes checked before any deserialization |
| Gradient poisoning defense | NaN/Inf/RMS-norm bounds enforced; poisoned gradients rejected before the averager |
| Sybil resistance | /24 subnet contribution cap in FedAvg; configurable prefix |
| Reputation system | Per-peer rejection rate tracking; temporary ban on threshold exceeded |
| Rate limiting | One gradient submission per peer per round |
| Path traversal prevention | node_id sanitized via regex; checkpoint_dir rejects system paths at startup |
| Atomic checkpoints | .tmp + os.replace() — crash cannot corrupt an existing checkpoint |
| SIGTERM-safe shutdown | Final checkpoint wrapped in CancelScope(shield=True) |
| Bootstrap dial timeout | 10s per-peer via anyio.move_on_after() — unresponsive relay can't block startup |
Tech Stack
| Layer | Technology |
|---|---|
| Networking | libp2p 0.6.0 — Ed25519 peer IDs, FloodSub, mDNS, Kademlia DHT |
| Deep learning | PyTorch ≥2.3 — direct param.grad access, MPS on Apple Silicon |
| Models | HuggingFace transformers ≥4.40 — any AutoModelForCausalLM |
| Datasets | HuggingFace datasets ≥2.20 — deterministic sharding, streaming |
| Async runtime | anyio + trio — libp2p requires trio; anyio keeps the rest backend-agnostic |
| Config | pydantic-settings — fail loudly at startup, not at runtime |
| Logging | structlog — JSON in Docker, human-readable console locally |
| Orchestration | Docker + docker-compose — reproducible multi-node simulation |
| Language | Python 3.12 — mypy --strict throughout |
Contributing
Read CLAUDE.md before contributing — it is the source of truth for every architecture and security decision in this codebase.
Three rules that are non-negotiable:
- No central server. Nodes coordinate only via libp2p gossip.
- No standard DDP. Gradients are extracted from
param.gradand exchanged manually. - Never
pickle.loads()peer data. Alwaysweights_only=True.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file swarm_tune-1.0.0.tar.gz.
File metadata
- Download URL: swarm_tune-1.0.0.tar.gz
- Upload date:
- Size: 142.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30771dfd59f0546263ab54fc0bac98eafaac1647920696a731dbf8c86b12d9ec
|
|
| MD5 |
ae7cebc58d1c8f5e38def34e2a22a18d
|
|
| BLAKE2b-256 |
edf91a9e1628a376cf29edb746fd08444feac0f086dd6ec189ec7d9a7a3164b6
|
Provenance
The following attestation bundles were made for swarm_tune-1.0.0.tar.gz:
Publisher:
publish.yml on yashasviudayan-py/Swarm-Tune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
swarm_tune-1.0.0.tar.gz -
Subject digest:
30771dfd59f0546263ab54fc0bac98eafaac1647920696a731dbf8c86b12d9ec - Sigstore transparency entry: 1074917133
- Sigstore integration time:
-
Permalink:
yashasviudayan-py/Swarm-Tune@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/yashasviudayan-py
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472 -
Trigger Event:
push
-
Statement type:
File details
Details for the file swarm_tune-1.0.0-py3-none-any.whl.
File metadata
- Download URL: swarm_tune-1.0.0-py3-none-any.whl
- Upload date:
- Size: 73.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab3bb1a29c9d18c6ce99045e61b9bc654c289fba00bdfbf7d536676535f15aa3
|
|
| MD5 |
b66c6129dc0d3e664c6c99740d13d953
|
|
| BLAKE2b-256 |
dbf4c2d13d5061e5efb8bc450e0d545788f2b7cfa81b810dccfb241aea9d17cd
|
Provenance
The following attestation bundles were made for swarm_tune-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on yashasviudayan-py/Swarm-Tune
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
swarm_tune-1.0.0-py3-none-any.whl -
Subject digest:
ab3bb1a29c9d18c6ce99045e61b9bc654c289fba00bdfbf7d536676535f15aa3 - Sigstore transparency entry: 1074917135
- Sigstore integration time:
-
Permalink:
yashasviudayan-py/Swarm-Tune@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/yashasviudayan-py
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472 -
Trigger Event:
push
-
Statement type: