Agent-safe fleet management for independent Solana validators and RPC nodes
Project description
solfleet
Agent-safe fleet management for independent Solana validators and RPC nodes. One config file describes your fleet across devnet, testnet, and mainnet. An MCP server (and a CLI) exposes Solana-aware status, safe in-place upgrades, and health-driven DNS failover to Claude or any MCP client. Every operation that changes a node is dry-run by default, policy-gated, and audited. solfleet never reads or moves your keypairs.
See PLAN.md for the roadmap and design notes.
Architecture
solfleet runs on the operator's machine (or a small VM). It talks to the fleet over JSON-RPC (read) and SSH/scp (act), builds artifacts on a separate build host, computes slot lag against each cluster's reference RPC, and manages failover records at the DNS provider. Every mutation flows through one gate and is written to a SQLite audit log.
flowchart TB
claude["Claude / any MCP client"]
subgraph operator["operator machine"]
mcp["solfleet-mcp (stdio)"]
cli["solfleet CLI"]
core["core: probe · safety gate · executor · dns"]
audit[("audit log (SQLite)")]
claude -->|MCP| mcp
mcp --> core
cli --> core
core --> audit
end
builder["build host (agave + geyser from source)"]
ref["cluster reference RPC"]
dns["DNS provider (Cloudflare / Route53)"]
subgraph fleet["fleet: devnet / testnet / mainnet"]
rpc["RPC nodes"]
val["voting validators"]
end
core -->|JSON-RPC :8899| rpc
core -->|JSON-RPC :8899| val
core -->|SSH / scp| rpc
core -->|SSH / scp| val
core -->|SSH build, fetch artifacts| builder
builder -. "artifact set + sha256" .-> core
core -->|slot lag / delinquency| ref
core -->|eject / restore A records| dns
How an in-place upgrade runs
sequenceDiagram
actor Op as Claude / operator
participant SF as solfleet
participant B as build host
participant N as node
participant R as reference RPC
Op->>SF: upgrade node to version (confirm)
SF->>SF: gate, policy + preflight (else stop)
SF->>B: build agave + geyser (or reuse cache)
B-->>SF: artifact set + sha256
SF->>N: scp artifacts as dest.solfleet-new
SF->>N: sha256 on node matches builder (else abort)
alt RPC node
SF->>N: systemctl stop
SF->>N: atomic swap (binary + geyser + marker)
SF->>N: systemctl start
else voting validator
SF->>N: atomic swap (binary + geyser + marker)
SF->>N: agave-validator exit (leader-aware), systemd relaunches
end
loop until healthy and caught up
SF->>R: getSlot
SF->>N: getHealth / getSlot
end
SF->>SF: verify reported version, write audit entry
How failover runs
sequenceDiagram
participant SF as solfleet watch
participant N as pool members
participant R as reference RPC
participant D as DNS provider
loop every interval
SF->>N: getHealth / getSlot
SF->>R: getSlot (cluster head)
SF->>SF: per member: unhealthy, lag over limit, or delinquent
alt every member failing
SF->>SF: keep current records (never empty the pool)
else at least one healthy
SF->>D: ensure TXT ownership marker
SF->>D: remove A record of each failing member
SF->>D: add A record of each recovered member
SF->>SF: write audit entry
end
end
Why
- Solana-aware health. A generic health check sees HTTP 200; a Solana node can be 500 slots behind and still return 200. solfleet checks slot lag against the cluster, delinquency, and version drift.
- Build-and-distribute. Agave v3.0 dropped prebuilt validator
binaries, so every operator now has to build from source. solfleet
builds once on a dedicated builder node (with the ABI-matched
Yellowstone geyser
.so), caches it, and distributes the artifact set to the fleet. - Leader-aware restarts. Restarting a voting validator during its own leader slots skips blocks. solfleet restarts validators via a leader-aware safe-exit; RPC nodes cycle via systemctl.
- Safe failover. The watch loop pulls lagging/unhealthy nodes out of DNS and restores them on recovery, and refuses to ever empty a pool.
Status
v1. Built and unit-tested (91 tests, CI on Python 3.11-3.13). Most paths are also proven live against a disposable devnet node and a real Cloudflare zone.
Proven live:
- read path:
status,validate,vote-status,inspect restart(RPC via systemctl; validator via leader-aware safe-exit)- in-place
upgradeend to end (build agave from source on a builder, distribute, sha256-verify on the target, atomic swap, catch-up) for both RPC and voting-validator nodes bootstrap-builder(toolchain + deps on a bare builder)provisiona voting validator from bare disks (format NVMe, install, render the voting unit, start, catch up, vote)- DNS driver plus
dns status/eject/restoreand last-member protection, against a live Cloudflare zone
Unit-tested but not yet run live:
- the autonomous
watchloop (probe -> decide -> act); its decision logic is unit-tested and it reuses the now-proven Cloudflare driver - the Route53 driver (no AWS zone to point at yet)
Not built yet: HTTP transport (MCP is stdio-only today). See PLAN.md (M6).
Install
pipx install solfleet # not yet published; for now:
pipx install git+https://github.com/sanjeevkkansal/solfleet
pipx install 'solfleet[route53]' # if you use Route53 for DNS
Quick start
cp fleet.example.yaml fleet.yaml # edit with your nodes
cp policy.example.yaml policy.yaml # optional; sane defaults if absent
solfleet status # probe the fleet
solfleet status --watch # refreshing live table
solfleet validate # structural + live readiness check
solfleet vote-status mn-val-1 # voting health: credits, balance, delinquency, leader
solfleet inspect mn-val-1 # read-only SSH detail for one node
solfleet bootstrap-builder b1 # install build toolchain on a builder; --confirm
solfleet provision rpc-1 4.1.0 # dry-run bring-up plan; --confirm to run
solfleet plan-upgrade mn-val-1 4.1.0 # dry-run upgrade plan
solfleet upgrade mn-val-1 4.1.0 # dry-run; add --confirm to execute
solfleet watch --dry-run # DNS failover loop, decide-only
MCP (Claude Code):
claude mcp add solfleet -- solfleet-mcp
Example session
Pointed at a small devnet fleet. With no flags, commands are read-only or dry-run.
Fleet health is Solana-aware, not just an HTTP 200:
$ solfleet status
CLUSTER NODE ROLE HEALTH VERSION SLOT LAG VOTE
devnet rpc-1 rpc ok 4.1.0-rc.1 0 -
devnet rpc-2 rpc ok 4.1.0-rc.1 0 -
An upgrade is dry-run by default. It returns the ordered plan and the gate
decision and changes nothing until you pass --confirm:
$ solfleet plan-upgrade rpc-1 4.1.0
{
"decision": {
"operation": "upgrade",
"cluster": "devnet",
"node": "rpc-1",
"mode": "dry-run",
"allowed": true,
"plan": [
"on builder 'build-1': build agave 4.1.0 from source",
"distribute artifact set to rpc-1; checksum-verify each (abort on mismatch)",
"stop solana-validator, swap, start",
"swap /usr/local/bin/agave-validator + geyser .so + version marker atomically",
"wait until healthy + caught up to https://api.devnet.solana.com",
"verify reported version == 4.1.0; record before/after"
],
"reasons": [
"dry-run: preflight checks pass; pass confirm=true to execute"
]
},
"target_version": "4.1.0"
}
Over MCP, the same operations are tools (fleet_status, plan_node_upgrade,
upgrade, ...). Claude gets that same plan back and has to pass confirm=true
to execute, so an agent cannot mutate a node by accident.
Tools
Read-only: fleet_status, node_detail, version_drift, vote_status,
leader_schedule, validate, plan_node_upgrade, dns_pool_status,
audit_log.
Gated (dry-run by default; confirm=true to execute):
bootstrap_builder_host, provision, restart, upgrade,
dns_pool_eject, dns_pool_restore.
Every mutation is dry-run by default, checked against policy.yaml
(allowed versions, disk floor, leader-window minimum), and written to a
SQLite audit log. The watch loop is the one autonomous mutator; it is
bounded by the same audit log and the never-empty-a-pool rule.
Safety model
- Dry-run by default. Mutations return their ordered plan and
preflight unless called with
confirm=true. - Policy gate. Per-cluster
policy.yaml: allowed version globs, disk floor, andrequire_leader_window_minutesfor validators. - Checksum-verified distribution. Upgrade artifacts are sha256-checked on the target against the builder before any swap.
- No keys, ever. solfleet does not read, move, or generate identity/vote keypairs. Voting-validator identity failover is out of scope by design (double-signing risk).
- Audit log. Every dry-run and execute is recorded in SQLite.
Development
uv venv && uv pip install -e '.[dev]'
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file solfleet-0.1.0.tar.gz.
File metadata
- Download URL: solfleet-0.1.0.tar.gz
- Upload date:
- Size: 121.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
963508a4ba800c440e5e319e8df205c2fdc9a17da5f4cc2b5b5a81550125214b
|
|
| MD5 |
780ca616b406fd68be7a6508846ebeae
|
|
| BLAKE2b-256 |
e54a7a39d0271f85349143816762f7a70b37bdfedc68e293ff8485800c332251
|
Provenance
The following attestation bundles were made for solfleet-0.1.0.tar.gz:
Publisher:
publish.yml on sanjeevkkansal/solfleet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
solfleet-0.1.0.tar.gz -
Subject digest:
963508a4ba800c440e5e319e8df205c2fdc9a17da5f4cc2b5b5a81550125214b - Sigstore transparency entry: 1971070751
- Sigstore integration time:
-
Permalink:
sanjeevkkansal/solfleet@19ae304d373a8821140743462902b71c299113fb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sanjeevkkansal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@19ae304d373a8821140743462902b71c299113fb -
Trigger Event:
release
-
Statement type:
File details
Details for the file solfleet-0.1.0-py3-none-any.whl.
File metadata
- Download URL: solfleet-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e91a739dbd9e1db2a86ee59b90aafd5eb8841e4890e0d4c65c95ea1fafd57cb7
|
|
| MD5 |
f0138367b6a8e383070e6a0fa59f807a
|
|
| BLAKE2b-256 |
7b3e5959ce9e6045e2700786f7a286bb4f8f5bcc7b587e66074c5748c5cbfb9c
|
Provenance
The following attestation bundles were made for solfleet-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on sanjeevkkansal/solfleet
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
solfleet-0.1.0-py3-none-any.whl -
Subject digest:
e91a739dbd9e1db2a86ee59b90aafd5eb8841e4890e0d4c65c95ea1fafd57cb7 - Sigstore transparency entry: 1971070864
- Sigstore integration time:
-
Permalink:
sanjeevkkansal/solfleet@19ae304d373a8821140743462902b71c299113fb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sanjeevkkansal
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@19ae304d373a8821140743462902b71c299113fb -
Trigger Event:
release
-
Statement type: