Skip to main content

AI compute arbitrage CLI — move GPU training jobs between clouds automatically

Project description

VaultLayer

Run AI training jobs on managed GPU capacity with checkpointing, log streaming, and provider failover.

pip install -U vaultlayer
vl init
vl run train.py
Job submitted
Training on vast_ai
Training output:
...
Training completed successfully.

What It Does

VaultLayer sits between your training script and the cloud. It:

  • Checkpoints automatically — syncs model weights + optimizer state to a zero-egress R2 Vault on every save
  • Detects interruptions — intercepts AWS/GCP/Azure termination signals before your job dies
  • Migrates instantly — provisions a replacement node on the cheapest available provider and resumes from last checkpoint
  • Tracks savings — shows real-time cost vs what you would have paid on AWS On-Demand

No changes to your PyTorch or JAX code. No YAML configs. No PhD-level infra knowledge required.

Commands

# Training
vl run train.py
vl ps
vl logs <job-id> --follow
vl stop <job-id>

# Dataset storage (no S3 required)
vl sync ./data --dataset-id my-dataset
vl run --data r2://my-dataset train.py
vl datasets

Supported Providers

Provider Type Status
Vast.ai Marketplace Production-included
RunPod Neocloud Production-included
Lambda Labs Neocloud Production-included
AWS Spot Hyperscaler Production-included for validated failover paths
AWS On-Demand Hyperscaler Internal testing
GCP, CoreWeave, Crusoe, Nebius, Voltage Park, Hyperstack, Azure Mixed Pending validation

Current provider status lives in docs/provider_test_matrix.md and docs/provider_testing_matrix.md.


Model Size Support

Model Size Method Checkpoint Size Status
1B QLoRA small Validated smoke path
3B QLoRA small Validated matrix path
7B QLoRA medium Validated matrix path
72B QLoRA large Routed to 96GB+ high-VRAM capacity
Full fine-tune / multi-GPU varies varies Future work

Tech Stack

Layer Technology Cost
Code + Docs GitHub (this repo) Free
CI/CD GitHub Actions Free (2k min/mo)
Vault / Storage Cloudflare R2 Free up to 10GB
Agent Runtime Railway Free $5/mo credit
Webhooks Cloudflare Workers Free 100k req/day
Agent Message Queue Upstash Redis Free 10k cmd/day

Repository Structure

vaultlayer/
├── README.md
├── docs/
│   ├── PRD.md              # Full product requirements
│   ├── ARCHITECTURE.md     # System design + agent topology
│   └── AGENTS.md           # Agent specs + build order
├── dashboard/
│   └── index.html          # Savings dashboard prototype
└── src/
    ├── cli/
    │   ├── main.py
    │   ├── run.py
    │   ├── checkpoint_template.py
    │   └── init.py
    ├── vaultlayer/
    │   └── _resume_hook.py
    ├── agents/
    │   ├── orchestration/
    │   ├── pricing/
    │   ├── watchdog/
    │   │   └── signals.py
    │   ├── vault/
    │   ├── broker/
    │   ├── finops/
    │   └── namespace/
    └── shared/

SLA

VaultLayer tracks job completion, checkpoint persistence, and resume behavior. Public SLA numbers are not committed during beta; see docs/SLA_SLI.md for definitions.


Dataset Storage (No S3 Required)

VaultLayer's Neutral Zone (Cloudflare R2) is a first-class storage provider. Users with no AWS or cloud storage account can upload training data directly and train from it on any provider.

# Upload from your laptop / on-prem server
vl sync ./training-data --dataset-id my-dataset

# Train — data is mounted at /mnt/vaultlayer on every provisioned node
vl run --data r2://my-dataset train.py

# See what you're storing and the monthly cost
vl datasets

Pricing:

Action Cost
Upload (local → R2) Free
Storage $0.020 / GB / month ($0.0195 — 30% markup over Cloudflare R2 base rate)
Read (R2 → training node) $0.00 (zero egress within Cloudflare network)
S3 mirror (one-time) AWS egress charge (~$0.09/GB, first 100 GB/month free)

Storage quotas by plan:

Plan Storage limit
Free 10 GB
Pro 500 GB
Enterprise Unlimited

Datasets are soft-deleted with vl datasets --delete <id> — billing stops immediately, R2 objects are purged within 24 hours.

Getting Started

pip install -U vaultlayer
vl init
vl run train.py

License

Private — © 2026 VaultLayer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vaultlayer-0.1.33.tar.gz (944.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vaultlayer-0.1.33-py3-none-any.whl (628.1 kB view details)

Uploaded Python 3

File details

Details for the file vaultlayer-0.1.33.tar.gz.

File metadata

  • Download URL: vaultlayer-0.1.33.tar.gz
  • Upload date:
  • Size: 944.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vaultlayer-0.1.33.tar.gz
Algorithm Hash digest
SHA256 ad18a53703ede31f19553f0804426898bc05684e32f8264329ad49f5f93940c3
MD5 61e8b9014e989ac0bbf57e942f5c5737
BLAKE2b-256 203ea8f07efd808e3028ea16dcb9335d26c2da1fb849d6dbc46c344a7a28cad9

See more details on using hashes here.

Provenance

The following attestation bundles were made for vaultlayer-0.1.33.tar.gz:

Publisher: publish.yml on hector25/vaultlayer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vaultlayer-0.1.33-py3-none-any.whl.

File metadata

  • Download URL: vaultlayer-0.1.33-py3-none-any.whl
  • Upload date:
  • Size: 628.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vaultlayer-0.1.33-py3-none-any.whl
Algorithm Hash digest
SHA256 2a58ba599eb4a55d172e6877d3542c48d38724149fca36ca1dd1dd6372a9a2b0
MD5 d93bd624281117274de2f0bc12fa55e6
BLAKE2b-256 00b4e62de1fea08ed10b2b79e4fe28f4d0f48814a727e2bb33f0cec0fa5d985d

See more details on using hashes here.

Provenance

The following attestation bundles were made for vaultlayer-0.1.33-py3-none-any.whl:

Publisher: publish.yml on hector25/vaultlayer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page