AI compute arbitrage CLI — move GPU training jobs between clouds automatically
Project description
VaultLayer
Run AI training jobs on managed GPU capacity with checkpointing, log streaming, and provider failover.
pip install -U vaultlayer
vl init
vl run train.py
Job submitted
Training on vast_ai
Training output:
...
Training completed successfully.
What It Does
VaultLayer sits between your training script and the cloud. It:
- Checkpoints automatically — syncs model weights + optimizer state to a zero-egress R2 Vault on every save
- Detects interruptions — intercepts AWS/GCP/Azure termination signals before your job dies
- Migrates instantly — provisions a replacement node on the cheapest available provider and resumes from last checkpoint
- Tracks savings — shows real-time cost vs what you would have paid on AWS On-Demand
No changes to your PyTorch or JAX code. No YAML configs. No PhD-level infra knowledge required.
Commands
# Training
vl run train.py
vl ps
vl logs <job-id> --follow
vl stop <job-id>
# Dataset storage (no S3 required)
vl sync ./data --dataset-id my-dataset
vl run --data r2://my-dataset train.py
vl datasets
Supported Providers
| Provider | Type | Status |
|---|---|---|
| Vast.ai | Marketplace | Production-included |
| RunPod | Neocloud | Production-included |
| Lambda Labs | Neocloud | Production-included |
| AWS Spot | Hyperscaler | Production-included for validated failover paths |
| AWS On-Demand | Hyperscaler | Internal testing |
| GCP, CoreWeave, Crusoe, Nebius, Voltage Park, Hyperstack, Azure | Mixed | Pending validation |
Current provider status lives in docs/provider_test_matrix.md and docs/provider_testing_matrix.md.
Model Size Support
| Model Size | Method | Checkpoint Size | Status |
|---|---|---|---|
| 1B | QLoRA | small | Validated smoke path |
| 3B | QLoRA | small | Validated matrix path |
| 7B | QLoRA | medium | Validated matrix path |
| 72B | QLoRA | large | Validated on RunPod H100-class capacity |
| Full fine-tune / multi-GPU | varies | varies | Future work |
Tech Stack
| Layer | Technology | Cost |
|---|---|---|
| Code + Docs | GitHub (this repo) | Free |
| CI/CD | GitHub Actions | Free (2k min/mo) |
| Vault / Storage | Cloudflare R2 | Free up to 10GB |
| Agent Runtime | Railway | Free $5/mo credit |
| Webhooks | Cloudflare Workers | Free 100k req/day |
| Agent Message Queue | Upstash Redis | Free 10k cmd/day |
Repository Structure
vaultlayer/
├── README.md
├── docs/
│ ├── PRD.md # Full product requirements
│ ├── ARCHITECTURE.md # System design + agent topology
│ └── AGENTS.md # Agent specs + build order
├── dashboard/
│ └── index.html # Savings dashboard prototype
└── src/
├── cli/
│ ├── main.py
│ ├── run.py
│ ├── checkpoint_template.py
│ └── init.py
├── vaultlayer/
│ └── _resume_hook.py
├── agents/
│ ├── orchestration/
│ ├── pricing/
│ ├── watchdog/
│ │ └── signals.py
│ ├── vault/
│ ├── broker/
│ ├── finops/
│ └── namespace/
└── shared/
SLA
VaultLayer tracks job completion, checkpoint persistence, and resume behavior. Public SLA numbers are not committed during beta; see docs/SLA_SLI.md for definitions.
Dataset Storage (No S3 Required)
VaultLayer's Neutral Zone (Cloudflare R2) is a first-class storage provider. Users with no AWS or cloud storage account can upload training data directly and train from it on any provider.
# Upload from your laptop / on-prem server
vl sync ./training-data --dataset-id my-dataset
# Train — data is mounted at /mnt/vaultlayer on every provisioned node
vl run --data r2://my-dataset train.py
# See what you're storing and the monthly cost
vl datasets
Pricing:
| Action | Cost |
|---|---|
| Upload (local → R2) | Free |
| Storage | $0.020 / GB / month ($0.0195 — 30% markup over Cloudflare R2 base rate) |
| Read (R2 → training node) | $0.00 (zero egress within Cloudflare network) |
| S3 mirror (one-time) | AWS egress charge (~$0.09/GB, first 100 GB/month free) |
Storage quotas by plan:
| Plan | Storage limit |
|---|---|
| Free | 10 GB |
| Pro | 500 GB |
| Enterprise | Unlimited |
Datasets are soft-deleted with vl datasets --delete <id> — billing stops immediately,
R2 objects are purged within 24 hours.
Getting Started
pip install -U vaultlayer
vl init
vl run train.py
License
Private — © 2026 VaultLayer
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vaultlayer-0.1.31.tar.gz.
File metadata
- Download URL: vaultlayer-0.1.31.tar.gz
- Upload date:
- Size: 941.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cfb20ae4282eff2edf0dfc4354e32a98fdd6d3462fe420046657aa08b44f7f6
|
|
| MD5 |
d6c2992f2c2f5cdf23c561898d245fdb
|
|
| BLAKE2b-256 |
a630b709a17b241aa66d0265fd889cacedc3aabb6ae85ea60d01a791879af039
|
Provenance
The following attestation bundles were made for vaultlayer-0.1.31.tar.gz:
Publisher:
publish.yml on hector25/vaultlayer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vaultlayer-0.1.31.tar.gz -
Subject digest:
9cfb20ae4282eff2edf0dfc4354e32a98fdd6d3462fe420046657aa08b44f7f6 - Sigstore transparency entry: 1415982478
- Sigstore integration time:
-
Permalink:
hector25/vaultlayer@9ae607b5dee2db4a8d8edeee549b16a730ad2f85 -
Branch / Tag:
refs/tags/v0.1.31 - Owner: https://github.com/hector25
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9ae607b5dee2db4a8d8edeee549b16a730ad2f85 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vaultlayer-0.1.31-py3-none-any.whl.
File metadata
- Download URL: vaultlayer-0.1.31-py3-none-any.whl
- Upload date:
- Size: 626.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
982e3008aff125ddb64f97917cb8af4b6970cfa37ea6261083393ce6394709ab
|
|
| MD5 |
a038eb0d1a4a03453301568fb88654c2
|
|
| BLAKE2b-256 |
4b10f6f5cd7bc45e9c94f9d16a98f67b1a889795b14d5c6d4c1834c877a62a07
|
Provenance
The following attestation bundles were made for vaultlayer-0.1.31-py3-none-any.whl:
Publisher:
publish.yml on hector25/vaultlayer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vaultlayer-0.1.31-py3-none-any.whl -
Subject digest:
982e3008aff125ddb64f97917cb8af4b6970cfa37ea6261083393ce6394709ab - Sigstore transparency entry: 1415982644
- Sigstore integration time:
-
Permalink:
hector25/vaultlayer@9ae607b5dee2db4a8d8edeee549b16a730ad2f85 -
Branch / Tag:
refs/tags/v0.1.31 - Owner: https://github.com/hector25
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9ae607b5dee2db4a8d8edeee549b16a730ad2f85 -
Trigger Event:
push
-
Statement type: