AI compute arbitrage CLI — move GPU training jobs between clouds automatically
Project description
VaultLayer
Run AI training jobs across 11 GPU cloud providers — 60–70% cheaper than AWS on-demand, with 99.9% job completion SLA, using your existing commands unchanged. 93% GPU cloud market coverage (~$39B addressable).
pip install vaultlayer
vaultlayer run python train.py --model llama-3-7b --epochs 10
✓ Job completed in 4h 32m
✓ 1 interruption recovered automatically (AWS Spot → Lambda H100)
✓ Saved $142.40 vs AWS On-Demand
→ View full report: https://vaultlayer.pages.dev/jobs/j-0042
What It Does
VaultLayer sits between your training script and the cloud. It:
- Checkpoints automatically — syncs model weights + optimizer state to a zero-egress R2 Vault on every save
- Detects interruptions — intercepts AWS/GCP/Azure termination signals before your job dies
- Migrates instantly — provisions a replacement node on the cheapest available provider and resumes from last checkpoint
- Tracks savings — shows real-time cost vs what you would have paid on AWS On-Demand
No changes to your PyTorch or JAX code. No YAML configs. No PhD-level infra knowledge required.
Commands
# Training
vaultlayer run python train.py # run with full protection
vaultlayer run --data s3://bucket/prefix python train.py # mirror S3→R2 then run
vaultlayer run --data r2://my-dataset python train.py # use dataset already in R2
vaultlayer run --regions eu-central-1,eu-west-1 python train.py # GDPR-only regions
vaultlayer run --excluded-regions cn-north-1 python train.py # never use China region
vaultlayer stop <job-id> # graceful stop + checkpoint
vaultlayer logs <job-id> [--tail N] [--follow] # stream logs from R2
# Dataset storage (no S3 required)
vaultlayer sync ./data --dataset-id my-dataset # upload local data → R2
vaultlayer sync s3://bucket/prefix --dataset-id my-dataset # mirror S3 → R2 (one-time egress)
vaultlayer datasets # list datasets + storage costs
vaultlayer datasets --delete my-dataset # delete + stop billing
# Region discovery
vaultlayer regions list-all # list all valid regions + compliance notes
vaultlayer regions current # show current provisioning region
Supported Providers
| Provider | Type | Status |
|---|---|---|
| AWS EC2 Spot | Hyperscaler | ✅ Live |
| Lambda Labs | Neocloud | ✅ Live |
| CoreWeave | Neocloud | ✅ Live |
| RunPod | Neocloud | ✅ Live |
| Vast.ai | Neocloud | ✅ Live |
| Voltage Park | Neocloud | ✅ Live |
| Crusoe | Neocloud | ✅ Live |
| Nebius | Neocloud | ✅ Live |
| Hyperstack | Neocloud | ✅ Live |
| GCP | Hyperscaler | ✅ Live |
| Azure | Hyperscaler | ✅ Live |
| AWS On-Demand | Hyperscaler | ✅ Last-resort fallback |
11 providers live — 93% GPU cloud market coverage (~$39B addressable, ~$21B migratable training).
Model Size Support
| Model Size | Method | Checkpoint Size | Status |
|---|---|---|---|
| 7B | Full fine-tune | ~69 GB | ✅ MVP |
| 13B | Full fine-tune | ~125 GB | ✅ MVP |
| 30B | Full fine-tune | ~288 GB | ✅ MVP |
| 70B | QLoRA (4-bit) | ~46 GB | ✅ MVP |
| 70B | Full fine-tune | ~782 GB | 🔜 Phase 2 |
Tech Stack
| Layer | Technology | Cost |
|---|---|---|
| Code + Docs | GitHub (this repo) | Free |
| CI/CD | GitHub Actions | Free (2k min/mo) |
| Vault / Storage | Cloudflare R2 | Free up to 10GB |
| Agent Runtime | Railway | Free $5/mo credit |
| Webhooks | Cloudflare Workers | Free 100k req/day |
| Agent Message Queue | Upstash Redis | Free 10k cmd/day |
Repository Structure
vaultlayer/
├── README.md
├── docs/
│ ├── PRD.md # Full product requirements
│ ├── ARCHITECTURE.md # System design + agent topology
│ └── AGENTS.md # Agent specs + build order
├── dashboard/
│ └── index.html # Savings dashboard prototype
└── src/
├── cli/
│ ├── main.py
│ ├── run.py
│ ├── checkpoint_template.py
│ └── init.py
├── vaultlayer/
│ └── _resume_hook.py
├── agents/
│ ├── orchestration/
│ ├── pricing/
│ ├── watchdog/
│ │ └── signals.py
│ ├── vault/
│ ├── broker/
│ ├── finops/
│ └── namespace/
└── shared/
SLA
99.9% job completion rate — not node uptime. Jobs survive infrastructure failures. Recovery SLA: interrupted job resumes within 10 minutes from last checkpoint.
Dataset Storage (No S3 Required)
VaultLayer's Neutral Zone (Cloudflare R2) is a first-class storage provider. Users with no AWS or cloud storage account can upload training data directly and train from it on any provider.
# Upload from your laptop / on-prem server
vaultlayer sync ./training-data --dataset-id my-dataset
# Train — data is mounted at /mnt/vaultlayer on every provisioned node
vaultlayer run --data r2://my-dataset python train.py
# See what you're storing and the monthly cost
vaultlayer datasets
Pricing:
| Action | Cost |
|---|---|
| Upload (local → R2) | Free |
| Storage | $0.020 / GB / month ($0.0195 — 30% markup over Cloudflare R2 base rate) |
| Read (R2 → training node) | $0.00 (zero egress within Cloudflare network) |
| S3 mirror (one-time) | AWS egress charge (~$0.09/GB, first 100 GB/month free) |
Storage quotas by plan:
| Plan | Storage limit |
|---|---|
| Free | 10 GB |
| Pro | 500 GB |
| Enterprise | Unlimited |
Datasets are soft-deleted with vaultlayer datasets --delete <id> — billing stops immediately,
R2 objects are purged within 24 hours.
Region Control
VaultLayer can provision nodes in any AWS region that has GPU capacity. By default it uses
the region from your vaultlayer init configuration.
Restrict to specific regions (e.g. GDPR compliance — EU data stays in EU):
vaultlayer run --regions eu-central-1,eu-west-1 python train.py
Exclude regions (e.g. avoid China, GovCloud, sanctioned territories):
vaultlayer run --excluded-regions cn-north-1,cn-northwest-1 python train.py
If both flags are given, --excluded-regions takes priority. If neither is given, any region is allowed.
Discover regions:
vaultlayer regions list-all # all GPU-capable regions with compliance notes
vaultlayer regions current # show which region your credentials point to
Compliance note: H100/A100 exports to certain regions (China, Russia, some Middle East countries) may require a US Bureau of Industry and Security (BIS) export license. VaultLayer blocks
cn-north-1,cn-northwest-1, andru-central-1by default via OFAC screening. Use--regionsto limit jobs to GDPR-compliant EU regions.
Getting Started
pip install vaultlayer
vaultlayer init
vaultlayer run python train.py
License
Private — © 2026 VaultLayer
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vaultlayer-0.1.14.tar.gz.
File metadata
- Download URL: vaultlayer-0.1.14.tar.gz
- Upload date:
- Size: 918.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab8d8a8d5f1cbf472dcdd3913d0b7a1374320a1d10dea73fcb4af766c50b56bf
|
|
| MD5 |
6a5e28de434daa10a88f7327138a61d5
|
|
| BLAKE2b-256 |
ca87a7c10bf5602c8dcfff7e66d5be2283ea05c2aa88a1231ec74e198bb4b6ec
|
File details
Details for the file vaultlayer-0.1.14-py3-none-any.whl.
File metadata
- Download URL: vaultlayer-0.1.14-py3-none-any.whl
- Upload date:
- Size: 604.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
194011d502117caf1739042d05d6b156b3ffc6f8d14f8052c964a03bb7a9cfca
|
|
| MD5 |
a76ea36f90097ba8b5f6b6864b520bea
|
|
| BLAKE2b-256 |
2ba0af15c7ae659bd3f4323047caa1c1e7dfc2075e2c0b2d609fa819ed5c62fa
|