Manage GCP TPU VMs from your terminal — create, run, recover, teardown
Project description
tpuz
Manage GCP TPU & GPU VMs from your terminal.
Create, train, debug, recover, teardown — one command.
Getting Started · Docs · GPU Guide · Security
Why?
Training on GCP TPUs/GPUs means juggling gcloud commands, SSH sessions, preemption recovery, cost tracking, and secrets. tpuz handles all of it:
from tpuz import TPU
tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", secrets=["WANDB_API_KEY"], sync="./src")
tpu.logs()
tpu.cost_summary() # $4.12 (2.0h × $2.06/hr)
tpu.down()
Or one command:
tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown
Install
pip install tpuz
Zero Python dependencies. Requires gcloud CLI (install).
Features
Core
tpu.up() # Create TPU VM (idempotent)
tpu.up_queued() # Queued Resources (reliable spot)
tpu.setup() # Install JAX[TPU] + deps
tpu.verify() # Check JAX on all workers
tpu.run("cmd", sync=".") # Upload code + launch
tpu.logs() # Stream training logs
tpu.wait() # Poll for completion
tpu.collect(["model.pkl"]) # Download artifacts
tpu.down() # Delete VM
GPU VMs
from tpuz import GCE
vm = GCE.gpu("my-vm", gpu="a100") # A100 40GB
vm = GCE.gpu("my-vm", gpu="h100x8") # 8x H100
vm = GCE.gpu("my-vm", gpu="t4") # T4 (cheapest)
vm.up() # Same API as TPU
Secrets (Cloud Secret Manager)
from tpuz import SecretManager
sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()
tpu.run("python train.py", secrets=["WANDB_API_KEY"])
# Secrets never leave GCP — loaded server-side via IAM
Checkpoints (GCS)
from tpuz import GCS
gcs = GCS("gs://my-bucket")
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Auto-detects latest checkpoint → appends --resume-from-step=5000
Preemption Recovery
tpu.watch_notify("python train.py",
notify_url="https://hooks.slack.com/...",
max_retries=5)
# Auto: delete → recreate → setup → restart from checkpoint → Slack notify
Debugging
tpu.repl() # Interactive Python REPL
tpu.debug("python train.py") # VS Code debugger attach
tpu.tunnel(6006) # TensorBoard
tpu.health_check() # Full health dashboard:
# Process: running
# Heartbeat: fresh (12s ago)
# Disk: 45% (90/200 GB)
# GPU: 85% utilization
# Training: step 1234/5000 | loss 2.31 | 56,000 tok/s
# ETA: ~35m
Cost Control
tpu.cost_summary() # $4.12 (2.0h × $2.06/hr)
tpu.set_budget(50, notify_url=slack) # Alert at $40, kill at $50
tpu.schedule("python train.py",
start_after="22:00", max_cost=10) # Train overnight, budget $10
Scaling & Failover
tpu.scale("v4-32") # Upgrade: v4-8 → v4-32
TPU.create_multi_zone("tpu", "v4-8",
zones=["us-central2-b", "europe-west4-a"]) # Try each zone
Run-Once (Docker-like)
tpu.run_once("python train.py",
sync="./src", collect_files=["model.pkl"],
gcs=gcs, notify_url=slack)
# up → setup → resume → run → wait → collect → notify → down
Profiles & Audit
tpu.save_profile("big-run") # Save config for reuse
tpu = TPU.from_profile("big-run", "new-tpu") # Reuse later
tpu.dry_run("python train.py") # Preview commands without executing
Multi-Host (TPU Pods)
Auto-detected. All SSH commands parallel with per-worker retries:
| Accelerator | Chips | Workers | Spot $/hr |
|---|---|---|---|
v4-8 |
4 | 1 | $2.06 |
v4-32 |
16 | 4 | $8.24 |
v5litepod-8 |
8 | 1 | $9.60 |
v5litepod-64 |
64 | 8 | $76.80 |
v6e-8 |
8 | 1 | $9.60 |
CLI
tpuz up NAME -a v4-8 tpuz logs NAME
tpuz down NAME tpuz logs-all NAME
tpuz status NAME tpuz health NAME
tpuz setup NAME tpuz tunnel NAME 6006
tpuz verify NAME tpuz repl NAME
tpuz run NAME "cmd" --sync=. tpuz debug NAME "cmd"
tpuz wait NAME tpuz scale NAME v4-32
tpuz kill NAME tpuz cost NAME
tpuz collect NAME files... tpuz avail v4-8
tpuz watch NAME "cmd" tpuz preflight
tpuz train NAME "cmd" -a v4-8 --recover --teardown
tpuz run-once NAME "cmd" --sync=. --collect model.pkl
Documentation
- Getting Started — Zero to training in 9 steps
- Usage Guide — Every feature explained
- GPU VMs — A100/H100/T4 management
- Secrets & Security — Cloud Secret Manager setup
- Best Practices — Production workflows
Pair with kgz
pip install kgz # Kaggle free GPUs — execute code remotely
pip install tpuz # GCP TPU/GPU pods — manage VM lifecycle
Claude Code
mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md
Acknowledgments
Cloud TPU resources provided by Google's TPU Research Cloud (TRC) program.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tpuz-0.1.7.tar.gz.
File metadata
- Download URL: tpuz-0.1.7.tar.gz
- Upload date:
- Size: 47.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e29e92da40573c09bc3b82d635042d1cdfa7d31cdc9dc5d598094261dd42b8bb
|
|
| MD5 |
6ef31efe908e081a7717f4800a6cd378
|
|
| BLAKE2b-256 |
7269dcf87eaca9813e306bff9bceaea18e49c0da65b584062ac66a79aeefd947
|
Provenance
The following attestation bundles were made for tpuz-0.1.7.tar.gz:
Publisher:
publish.yaml on mlnomadpy/tpuz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tpuz-0.1.7.tar.gz -
Subject digest:
e29e92da40573c09bc3b82d635042d1cdfa7d31cdc9dc5d598094261dd42b8bb - Sigstore transparency entry: 1237925300
- Sigstore integration time:
-
Permalink:
mlnomadpy/tpuz@3f18779dc8dffb8bd4f7b6c3e5363730e5182ad2 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/mlnomadpy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@3f18779dc8dffb8bd4f7b6c3e5363730e5182ad2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file tpuz-0.1.7-py3-none-any.whl.
File metadata
- Download URL: tpuz-0.1.7-py3-none-any.whl
- Upload date:
- Size: 34.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
107a1a867f3b5e1963dc67907aa60e92b9a57923c2b29ffc0f20e070dc5c35ae
|
|
| MD5 |
28bdac7f6b681c6f576f4f8197659fce
|
|
| BLAKE2b-256 |
2326acda79fa53520b0d46bc4c93ee7a4cde7a6524b6116cc73e6fcdda4095a7
|
Provenance
The following attestation bundles were made for tpuz-0.1.7-py3-none-any.whl:
Publisher:
publish.yaml on mlnomadpy/tpuz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tpuz-0.1.7-py3-none-any.whl -
Subject digest:
107a1a867f3b5e1963dc67907aa60e92b9a57923c2b29ffc0f20e070dc5c35ae - Sigstore transparency entry: 1237925356
- Sigstore integration time:
-
Permalink:
mlnomadpy/tpuz@3f18779dc8dffb8bd4f7b6c3e5363730e5182ad2 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/mlnomadpy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@3f18779dc8dffb8bd4f7b6c3e5363730e5182ad2 -
Trigger Event:
push
-
Statement type: