Manage GCP TPU VMs from your terminal — create, run, recover, teardown
Project description
tpuz
Manage GCP TPU VMs from your terminal. Create, train, debug, recover, teardown — one command.
pip install tpuz
Why?
Training on TPU pods requires 10+ gcloud commands, manual SSH to each worker, no preemption handling, no cost visibility, and painful debugging. tpuz wraps it all:
from tpuz import TPU
tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", sync="./src")
tpu.logs()
tpu.down()
Or in one command:
tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown
Features
Lifecycle
tpu.preflight() # Verify gcloud config
tpu.up() # Create VM (idempotent)
tpu.up_queued() # Queued Resources API (reliable spot)
tpu.down() # Delete VM
tpu.info() # State, IPs, accelerator
tpu.setup(extra_pip="jax") # Install JAX[TPU] + deps
tpu.verify() # Verify JAX on all workers
Training
tpu.run("python train.py", sync="./src", env={"KEY": "val"})
tpu.logs() # Stream logs (Ctrl-C to detach)
tpu.logs_all() # Color-coded logs from ALL workers
tpu.is_running() # Check if alive
tpu.kill() # Stop training
tpu.wait() # Poll for COMPLETE/FAILED
tpu.collect(["model.pkl"]) # Download artifacts
Cost Tracking
tpu.cost_summary() # "$4.12 (2.0h x $2.06/hr v4-8 spot)"
GCS Checkpoint Sync
from tpuz import GCS
gcs = GCS("gs://my-bucket")
gcs.upload_checkpoint("./ckpt", "run-01", step=1000)
gcs.latest_step("run-01") # 5000
gcs.list_runs() # ["run-01", "run-02"]
# Auto-resume from latest checkpoint
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Finds step 5000 -> appends --resume-from-step=5000
Preemption Recovery
tpu.watch("python train.py", max_retries=5)
# Polls every 60s -> on PREEMPTED: delete -> recreate -> setup -> restart
# With Slack notifications
tpu.watch_notify("python train.py",
notify_url="https://hooks.slack.com/services/...",
max_retries=5)
Debugging
tpu.repl() # Interactive Python on worker 0
tpu.debug("python train.py", port=5678) # VS Code debugger attach
tpu.logs_all(lines=20) # All workers side by side
tpu.health_pretty() # Worker dashboard:
# Worker Status Last Log
# -------------------------------------------
# worker 0 running step 1234 | loss 2.31
# worker 1 running step 1234 | loss 2.31
# worker 2 stopped (no log)
SSH Tunnel
tpu.tunnel(6006) # TensorBoard: localhost:6006
tpu.tunnel(8888, 9999) # Jupyter: localhost:9999 -> TPU:8888
Scaling
tpu.scale("v4-32") # Delete -> recreate with v4-32 -> re-setup
Multi-Zone Failover
tpu = TPU.create_multi_zone("my-tpu", "v4-8",
zones=["us-central2-b", "us-central1-a", "europe-west4-a"])
Availability Check
TPU.availability("v4-8", zone="us-central2-b")
# {"available": True, "spot_rate": 2.06, "on_demand_rate": 6.18}
Run-Once (Docker-like)
tpu.run_once("python train.py",
sync="./src",
collect_files=["model.pkl", "results.json"],
gcs=gcs,
notify_url="https://hooks.slack.com/...")
# up -> setup -> resume -> run -> wait -> collect -> notify -> down
Scheduled Training
tpu.schedule("python train.py",
start_after="22:00", # Wait until 10 PM
max_cost=10.0) # Kill if exceeds $10
Environment Snapshot/Restore
tpu.snapshot_env(gcs=gcs) # pip freeze -> GCS
tpu.restore_env(gcs=gcs) # Restore after preemption
Secrets (Cloud Secret Manager)
Recommended: Use Google Cloud Secret Manager. Secrets never leave GCP:
from tpuz import SecretManager
# One-time setup
sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()
# Training: VM reads secrets directly from GCP
tpu.run("python train.py", secrets=["WANDB_API_KEY", "HF_TOKEN"])
Fallback: env={} writes a .env file via SCP (encrypted, but secrets transit your machine).
See docs/secrets.md for full setup guide and security comparison.
Multi-Host (TPU Pods)
Worker count auto-detected. All SSH commands run in parallel with per-worker retries:
| Accelerator | Chips | Workers | Spot $/hr |
|---|---|---|---|
v4-8 |
4 | 1 | $2.06 |
v4-32 |
16 | 4 | $8.24 |
v5litepod-8 |
8 | 1 | $9.60 |
v5litepod-64 |
64 | 8 | $76.80 |
v6e-8 |
8 | 1 | $9.60 |
CLI
# Lifecycle
tpuz up my-tpu -a v4-8
tpuz down my-tpu
tpuz status my-tpu
tpuz list
tpuz preflight
tpuz avail v4-8
# Training
tpuz setup my-tpu --pip="flaxchat"
tpuz verify my-tpu
tpuz run my-tpu "python train.py" --sync=./src
tpuz logs my-tpu
tpuz logs-all my-tpu
tpuz kill my-tpu
tpuz wait my-tpu
tpuz collect my-tpu model.pkl results.json
# Debugging
tpuz repl my-tpu
tpuz debug my-tpu "python train.py"
tpuz health my-tpu
tpuz tunnel my-tpu 6006
tpuz scale my-tpu v4-32
tpuz cost my-tpu
# Recovery
tpuz watch my-tpu "python train.py"
# All-in-one
tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown
tpuz run-once my-tpu "python train.py" --sync=. --collect model.pkl
Development Workflow
from tpuz import TPU
# 1. Develop on single host
dev = TPU("dev", accelerator="v4-8")
dev.up()
dev.setup()
dev.repl() # Interactive development
# 2. Test training
dev.run("python train.py --steps=10", sync="./src")
dev.logs()
# 3. Scale up
dev.scale("v4-32") # 4 workers now
dev.run("python train.py --steps=50000", sync="./src")
dev.watch("python train.py --steps=50000")
# 4. Collect and cleanup
dev.collect(["model.pkl", "results.json"])
dev.cost_summary() # $12.36
dev.down()
Documentation
- docs/secrets.md — Secrets & security guide (Cloud Secret Manager setup)
- docs/best-practices.md — Training workflow, cost optimization, multi-host tips
- SKILL.md — Claude Code skill reference
- CLAUDE.md — Quick reference for AI agents
Requirements
gcloudCLI installed and authenticated- GCP project with TPU quota
- Python 3.10+
- Zero Python dependencies
Pair with kgz
pip install kgz # Kaggle free GPUs
pip install tpuz # GCP TPU pods
Claude Code Integration
mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md
License
MIT
Acknowledgments
Cloud TPU resources for developing and testing tpuz were provided by Google's TPU Research Cloud (TRC) program. We gratefully acknowledge their support in making TPU access available for open-source research.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tpuz-0.1.0.tar.gz.
File metadata
- Download URL: tpuz-0.1.0.tar.gz
- Upload date:
- Size: 36.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dc01c8c25d917dcdd27d458e73bab51ec74ee432f0c593e8f424055979861d7
|
|
| MD5 |
cb5c0a7fffd894311a269e90b10f5529
|
|
| BLAKE2b-256 |
2b8a511dff71729ee88ec6eebe4a84eabafe113f64be2d13f86c44d94b627368
|
Provenance
The following attestation bundles were made for tpuz-0.1.0.tar.gz:
Publisher:
publish.yaml on mlnomadpy/tpuz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tpuz-0.1.0.tar.gz -
Subject digest:
3dc01c8c25d917dcdd27d458e73bab51ec74ee432f0c593e8f424055979861d7 - Sigstore transparency entry: 1237280505
- Sigstore integration time:
-
Permalink:
mlnomadpy/tpuz@a486953170de496b28e24b335fa6a0ba6709cf3d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/mlnomadpy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@a486953170de496b28e24b335fa6a0ba6709cf3d -
Trigger Event:
push
-
Statement type:
File details
Details for the file tpuz-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tpuz-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d267c7f4822eb33e74e23ba1558e59aa7b487950b2664ebc975000eace0e8052
|
|
| MD5 |
c1836b67a19fba04bf35e14cef0bd5a7
|
|
| BLAKE2b-256 |
2c515c28f34d4496bd6d3f330a6a6bc86ede4c38f6ea5c9e63794ef247228409
|
Provenance
The following attestation bundles were made for tpuz-0.1.0-py3-none-any.whl:
Publisher:
publish.yaml on mlnomadpy/tpuz
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tpuz-0.1.0-py3-none-any.whl -
Subject digest:
d267c7f4822eb33e74e23ba1558e59aa7b487950b2664ebc975000eace0e8052 - Sigstore transparency entry: 1237280507
- Sigstore integration time:
-
Permalink:
mlnomadpy/tpuz@a486953170de496b28e24b335fa6a0ba6709cf3d -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/mlnomadpy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@a486953170de496b28e24b335fa6a0ba6709cf3d -
Trigger Event:
push
-
Statement type: