swarm-tune

Decentralized P2P orchestration layer for fine-tuning ML models across commodity hardware

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

udayan.yashasvi

These details have not been verified by PyPI

Project description

Swarm-Tune

Overview

Swarm-Tune lets a group of people pool their gaming GPUs over the internet to collaboratively fine-tune a large language model — with no data center, no cloud bill, and no central authority.

Each participant runs one node. Every node holds a shard of the model and a shard of the dataset, trains locally, extracts raw gradients, and broadcasts them to peers over a libp2p P2P network. The swarm runs Federated Averaging every round; all nodes stay in sync.

20 participants × RTX 3090 (24 GB VRAM) = 480 GB pooled VRAM
LLaMA 3 70B requires ~140 GB → fits in the swarm, not on any single machine

The system is fully decentralized — no master node, no tracker, no coordinator. Nodes discover each other via Kademlia DHT, tolerate stragglers and failures with timeout-based partial aggregation, and defend against adversarial participants with gradient validation and Sybil resistance. Competing swarms can race on the same model and dataset; winner is determined by perplexity, publicly verifiable by anyone.

Quick Start

Join an existing training run (requires Docker):

# Step 1 — Generate your .env and startup command
python scripts/join.py --run-id gpt2-wikitrain-001 --node-index <N>

# Step 2 — Start your node
docker run --rm --env-file my.env \
  -p 9000:9000 \
  -v ./checkpoints:/app/checkpoints \
  yashasviudayan/swarm-tune:latest

Run the 6-node local simulation (no internet required):

git clone https://github.com/yashasviudayan-py/Swarm-Tune
cd Swarm-Tune
make sim-up        # starts 5 honest + 1 adversarial node in Docker
make sim-logs      # stream structured JSON logs
make sim-kill-node NODE=swarm_node_2   # chaos: kill a node mid-training

Install the Python package:

pip install swarm-tune
swarm-tune --help

How It Works

Each training round:

Each node independently:
  1. Sample a mini-batch from its local data shard
  2. Forward pass → compute loss → loss.backward()
  3. Extract param.grad tensors, validate (NaN/Inf/norm bounds)
  4. Compress → serialize (SWRM wire format) → chunk into ≤60 KB frames
  5. Broadcast chunks over libp2p FloodSub

Simultaneously, receive from peers:
  6. Reassemble chunks → deserialize (weights_only=True) → decompress
  7. Validate each peer gradient (reject NaN/Inf/outliers/wrong shape)
  8. Submit to TimeoutAggregator (hard 30s window)

After timeout or quorum reached:
  9. Weighted FedAvg (weighted by dataset_size per peer)
  10. Apply averaged gradients → optimizer.step()

Straggler handling:
  - ≥ min_peers respond  → commit round
  - < min_peers respond  → fall back to local gradient, no round wasted
  - Dead nodes evicted via heartbeat after 60s, welcomed back on rejoin

Why Not PyTorch DDP?

DistributedDataParallel assumes microsecond data-center latency, homogeneous always-on nodes, and NCCL/Gloo transport. The internet is none of those things. This project builds the gradient exchange layer from scratch: custom wire protocol, chunked framing (Noise protocol 65 KB frame limit), timeout aggregation, and federated averaging — all of which DDP abstracts away, but which you need to understand to do distributed training in the wild.

Features

Capability	Detail
Any HuggingFace model	`SWARM_MODEL_NAME=gpt2` or `llama3` — config-only, no code change
Any HuggingFace dataset	Deterministic sharding via `dataset.shard()`
NAT traversal	libp2p circuit-relay + dcutr hole-punching for home internet nodes
Straggler tolerance	Timeout-based partial aggregation — one dead node never blocks the swarm
Gradient poisoning defense	NaN/Inf/RMS-norm validation before any peer gradient reaches the averager
Sybil resistance	Subnet contribution cap (`/24`) + per-peer rejection rate ban
Transparent chunking	Noise protocol 65 KB frame limit handled automatically
Live dashboard	Static HTML, no build step — force-directed peer graph, loss curves, bytes throughput
Zero-config join	`python scripts/join.py --run-id X --node-index N` generates your `.env`
Competition mode	Two swarms race on the same model/dataset; winner by perplexity, publicly verifiable
Relay node	`SWARM_RELAY_MODE=true` — run a bootstrap VPS with no training, stable peer ID
Checkpoint tools	Reconstruct full model from shards; publish to HuggingFace Hub with model card
110 tests	Unit, integration, chaos (fault injection, adversarial rejection, node drop/rejoin)

Architecture

src/swarm_tune/
├── config/settings.py          NodeSettings — all SWARM_ env vars, pydantic validators
├── runs/manifest.py            RunManifest — training campaign definition, .env generator
└── node/
    ├── main.py                 SwarmNode training loop + RelayNode + CLI entrypoint
    ├── metrics.py              MetricsStore + anyio TCP /metrics sidecar
    ├── p2p/
    │   ├── discovery.py        libp2p host, Ed25519 keys, mDNS, Kademlia DHT, relay dialing
    │   ├── gossip.py           GossipProtocol — FloodSub, chunked framing, eviction loop
    │   ├── heartbeat.py        Liveness signals + stale peer eviction
    │   └── peer_selector.py    AllPeersSelector + BanList (rejection rate tracking)
    ├── trainer/
    │   ├── model.py            ModelShard — HF AutoModelForCausalLM or toy MLP + layer sharding
    │   ├── data.py             DataShardLoader (.pt) + HFDataShardLoader (HF datasets)
    │   ├── gradient.py         GradientExtractor — extract + validate param.grad tensors
    │   ├── serializer.py       GradientSerializer — SWRM wire format, weights_only=True
    │   └── compressor.py       Compressor protocol: Identity (now) → TopK (bandwidth scale-up)
    └── aggregator/
        ├── averaging.py        GradientAverager — weighted FedAvg + Sybil subnet cap
        ├── timeout.py          TimeoutAggregator — partial aggregation + rate limiting
        └── strategy.py         AggregationStrategy: Flat (now) → Hierarchical (100+ nodes)

runs/
├── gpt2-wikitrain-001.json     4-node data-parallel GPT-2/WikiText-103 run
├── gpt2-competition-001.json   50-round competition manifest
└── gpt2-competition-2v2.json   2-node-per-team competition manifest

scripts/
├── join.py                     Zero-config participant onboarding
├── run_competition.py          Orchestrate two-team competition, write JSON result
├── set_bootstrap.py            Bake relay VPS multiaddr into all manifests
├── reconstruct_checkpoint.py   Merge/average shard checkpoints → full model
├── publish_checkpoint.py       Push checkpoint + model card to HuggingFace Hub
├── benchmark.py                Perplexity evaluation on WikiText-103 test split
└── generate_shards.py          Generate synthetic .pt shards for local simulation

docker/
├── Dockerfile                  Multi-stage build, non-root user, dynamic health check
├── docker-compose.yml          6-node simulation (5 honest + 1 adversarial)
└── docker-compose.relay.yml    Single relay/bootstrap node for VPS deployment

dashboard/index.html            Static vanilla-JS dashboard — no build step

Extensibility Abstractions

Three Protocol interfaces designed for zero-friction scale-up. The training loop never changes — only the implementation behind the protocol.

Protocol	Now (≤30 nodes)	Scale-up (100+ nodes)	How to swap
`Compressor`	`IdentityCompressor` (no-op)	`TopKCompressor` (~50× bandwidth reduction at 1%)	`SWARM_COMPRESSION=topk`
`PeerSelector`	`AllPeersSelector` + `BanList`	`ClusterPeerSelector`	`SWARM_AGGREGATION_STRATEGY=hierarchical`
`AggregationStrategy`	`FlatAggregation`	`HierarchicalAggregation`	`SWARM_AGGREGATION_STRATEGY=hierarchical`

Public Deployment

Hosting a relay/bootstrap node (VPS — $5/month)

A relay node gives participants a stable public multiaddr to connect to. It runs P2P only — no model loading, no training.

# On your VPS
echo "SWARM_NODE_KEY_SEED=your_long_secret_here" > relay.env
make relay-up                                  # docker compose -f docker/docker-compose.relay.yml up -d
make relay-logs | grep Multiaddr               # copy /ip4/<vps-ip>/tcp/9000/p2p/12D3KooW...

Once you have the multiaddr, bake it into all run manifests so join.py works with zero extra flags:

make set-bootstrap PEER="/ip4/<vps-ip>/tcp/9000/p2p/12D3KooW..."
git commit -am "chore: set bootstrap peer" && git push

After that, any participant worldwide can join with a single command — no --bootstrap-peer flag needed.

Participant onboarding (their machine)

git clone https://github.com/yashasviudayan-py/Swarm-Tune
python scripts/join.py --run-id gpt2-wikitrain-001 --node-index 2 --device cuda
# → writes my.env
# → prints: docker run --env-file my.env -p 9000:9000 yashasviudayan/swarm-tune:latest

Running a competition

make competition \
  COMPETITION_ID=gpt2-comp-001 \
  TEAM_A_ID=team-alpha  TEAM_A_CHECKPOINT=ckpts/alpha.pt \
  TEAM_B_ID=team-beta   TEAM_B_CHECKPOINT=ckpts/beta.pt

Results are written to results/competition_result.json. Anyone can independently verify by running make benchmark CHECKPOINT=<downloaded-checkpoint> against a published HuggingFace Hub checkpoint.

Development

Prerequisites: Python 3.12+, Docker, brew install gmp (macOS) or apt install libgmp-dev (Linux)

git clone https://github.com/yashasviudayan-py/Swarm-Tune
cd Swarm-Tune
make bootstrap          # venv + deps + pre-commit hooks

make check              # ruff lint + format check + mypy --strict
make test               # 110 unit + integration tests
make test-chaos         # 10 fault injection tests (node drop, adversarial, etc.)
make sim-up             # 6-node Docker simulation

All Makefile targets

# Code quality
make check              # lint + format + types
make format             # auto-format with ruff

# Testing
make test               # unit + integration (fast)
make test-chaos         # fault injection (slow, real timeouts)
make test-all           # everything
make coverage           # HTML coverage report → htmlcov/

# Simulation
make sim-up             # 6-node Docker swarm
make sim-down           # stop and remove containers
make sim-logs           # tail all container logs
make sim-kill-node NODE=swarm_node_2

# Training runs
make join RUN_ID=gpt2-wikitrain-001 NODE_INDEX=0
make reconstruct CHECKPOINT_DIR=checkpoints/ MODEL=gpt2
make publish CHECKPOINT=checkpoints/full_model.pt REPO_ID=user/model
make benchmark CHECKPOINT=checkpoints/node_0_final.pt

# Competition
make competition COMPETITION_ID=... TEAM_A_ID=... TEAM_A_CHECKPOINT=... \
                 TEAM_B_ID=... TEAM_B_CHECKPOINT=...

# Public deployment
make relay-up           # start relay node (requires relay.env)
make relay-logs         # tail relay logs, copy the multiaddr
make set-bootstrap PEER="/ip4/.../tcp/9000/p2p/12D3KooW..."

Project Status

All 8 phases complete. Production-audited. v1.0.0 released.

Phase	Description	Status
1	P2P Network — libp2p, Ed25519, mDNS, Kademlia, heartbeat, peer eviction	✅
2	Gradient Extraction — SWRM protocol, `weights_only=True`, FedAvg	✅
3	Gradient Sync — FloodSub, chunked framing, `TimeoutAggregator`	✅
4	Docker Simulation — 6-node sim, chaos tests, adversarial rejection	✅
5	Internet Deployment — HF models/datasets, NAT traversal, Sybil resistance, /metrics	✅
6	Live Dashboard — force-directed graph, persistent loss curves, bytes tracking	✅
7	Distribution — `RunManifest`, `join.py`, checkpoint reconstruction, HF Hub publish	✅
8	Competition — `run_competition.py`, 2v2 manifest, `make competition`, relay node	✅

Test coverage: 110 tests (unit + integration + chaos). mypy --strict clean. ruff clean.

Docker Hub: yashasviudayan/swarm-tune — latest and 1.0.0 tags available for linux/amd64 and linux/arm64.

Security Properties

Property	Mechanism
Cryptographic peer identity	Ed25519 key pairs via libp2p — spoofing requires breaking the key
No pickle deserialization	`torch.load(..., weights_only=True)` on all peer data
Wire format validation	SWRM magic bytes checked before any deserialization
Gradient poisoning defense	NaN/Inf/RMS-norm bounds enforced; poisoned gradients rejected before the averager
Sybil resistance	`/24` subnet contribution cap in FedAvg; configurable prefix
Reputation system	Per-peer rejection rate tracking; temporary ban on threshold exceeded
Rate limiting	One gradient submission per peer per round
Path traversal prevention	`node_id` sanitized via regex; `checkpoint_dir` rejects system paths at startup
Atomic checkpoints	`.tmp` + `os.replace()` — crash cannot corrupt an existing checkpoint
SIGTERM-safe shutdown	Final checkpoint wrapped in `CancelScope(shield=True)`
Bootstrap dial timeout	10s per-peer via `anyio.move_on_after()` — unresponsive relay can't block startup

Tech Stack

Layer	Technology
Networking	`libp2p` 0.6.0 — Ed25519 peer IDs, FloodSub, mDNS, Kademlia DHT
Deep learning	`PyTorch` ≥2.3 — direct `param.grad` access, MPS on Apple Silicon
Models	HuggingFace `transformers` ≥4.40 — any `AutoModelForCausalLM`
Datasets	HuggingFace `datasets` ≥2.20 — deterministic sharding, streaming
Async runtime	`anyio` + `trio` — libp2p requires trio; anyio keeps the rest backend-agnostic
Config	`pydantic-settings` — fail loudly at startup, not at runtime
Logging	`structlog` — JSON in Docker, human-readable console locally
Orchestration	Docker + docker-compose — reproducible multi-node simulation
Language	Python 3.12 — `mypy --strict` throughout

Contributing

Read CLAUDE.md before contributing — it is the source of truth for every architecture and security decision in this codebase.

Three rules that are non-negotiable:

No central server. Nodes coordinate only via libp2p gossip.
No standard DDP. Gradients are extracted from param.grad and exchanged manually.
Never pickle.loads() peer data. Always weights_only=True.

License

MIT. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

udayan.yashasvi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Mar 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarm_tune-1.0.0.tar.gz (142.2 kB view details)

Uploaded Mar 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

swarm_tune-1.0.0-py3-none-any.whl (73.1 kB view details)

Uploaded Mar 10, 2026 Python 3

File details

Details for the file swarm_tune-1.0.0.tar.gz.

File metadata

Download URL: swarm_tune-1.0.0.tar.gz
Upload date: Mar 10, 2026
Size: 142.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swarm_tune-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`30771dfd59f0546263ab54fc0bac98eafaac1647920696a731dbf8c86b12d9ec`
MD5	`ae7cebc58d1c8f5e38def34e2a22a18d`
BLAKE2b-256	`edf91a9e1628a376cf29edb746fd08444feac0f086dd6ec189ec7d9a7a3164b6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for swarm_tune-1.0.0.tar.gz:

Publisher: publish.yml on yashasviudayan-py/Swarm-Tune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: swarm_tune-1.0.0.tar.gz
- Subject digest: 30771dfd59f0546263ab54fc0bac98eafaac1647920696a731dbf8c86b12d9ec
- Sigstore transparency entry: 1074917133
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: yashasviudayan-py/Swarm-Tune@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/yashasviudayan-py
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472
- Trigger Event: push

File details

Details for the file swarm_tune-1.0.0-py3-none-any.whl.

File metadata

Download URL: swarm_tune-1.0.0-py3-none-any.whl
Upload date: Mar 10, 2026
Size: 73.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for swarm_tune-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ab3bb1a29c9d18c6ce99045e61b9bc654c289fba00bdfbf7d536676535f15aa3`
MD5	`b66c6129dc0d3e664c6c99740d13d953`
BLAKE2b-256	`dbf4c2d13d5061e5efb8bc450e0d545788f2b7cfa81b810dccfb241aea9d17cd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for swarm_tune-1.0.0-py3-none-any.whl:

Publisher: publish.yml on yashasviudayan-py/Swarm-Tune

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: swarm_tune-1.0.0-py3-none-any.whl
- Subject digest: ab3bb1a29c9d18c6ce99045e61b9bc654c289fba00bdfbf7d536676535f15aa3
- Sigstore transparency entry: 1074917135
- Sigstore integration time: Mar 10, 2026
Source repository:
- Permalink: yashasviudayan-py/Swarm-Tune@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/yashasviudayan-py
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@46ea13b65ca46a8e8b6dd3193a35337cfe0eb472
- Trigger Event: push

swarm-tune 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Overview

Quick Start

How It Works

Why Not PyTorch DDP?

Features

Architecture

Extensibility Abstractions

Public Deployment

Hosting a relay/bootstrap node (VPS — $5/month)

Participant onboarding (their machine)

Running a competition

Development

All Makefile targets

Project Status

Security Properties

Tech Stack

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance