Skip to main content

Manage GCP TPU VMs from your terminal — create, run, recover, teardown

Project description

tpuz

Manage GCP TPU VMs from your terminal. Create, train, debug, recover, teardown — one command.

pip install tpuz

Why?

Training on TPU pods requires 10+ gcloud commands, manual SSH to each worker, no preemption handling, no cost visibility, and painful debugging. tpuz wraps it all:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", sync="./src")
tpu.logs()
tpu.down()

Or in one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Features

Lifecycle

tpu.preflight()            # Verify gcloud config
tpu.up()                   # Create VM (idempotent)
tpu.up_queued()            # Queued Resources API (reliable spot)
tpu.down()                 # Delete VM
tpu.info()                 # State, IPs, accelerator
tpu.setup(extra_pip="jax") # Install JAX[TPU] + deps
tpu.verify()               # Verify JAX on all workers

Training

tpu.run("python train.py", sync="./src", env={"KEY": "val"})
tpu.logs()                 # Stream logs (Ctrl-C to detach)
tpu.logs_all()             # Color-coded logs from ALL workers
tpu.is_running()           # Check if alive
tpu.kill()                 # Stop training
tpu.wait()                 # Poll for COMPLETE/FAILED
tpu.collect(["model.pkl"]) # Download artifacts

Cost Tracking

tpu.cost_summary()  # "$4.12 (2.0h x $2.06/hr v4-8 spot)"

GCS Checkpoint Sync

from tpuz import GCS

gcs = GCS("gs://my-bucket")
gcs.upload_checkpoint("./ckpt", "run-01", step=1000)
gcs.latest_step("run-01")   # 5000
gcs.list_runs()              # ["run-01", "run-02"]

# Auto-resume from latest checkpoint
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Finds step 5000 -> appends --resume-from-step=5000

Preemption Recovery

tpu.watch("python train.py", max_retries=5)
# Polls every 60s -> on PREEMPTED: delete -> recreate -> setup -> restart

# With Slack notifications
tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/services/...",
    max_retries=5)

Debugging

tpu.repl()                                # Interactive Python on worker 0
tpu.debug("python train.py", port=5678)   # VS Code debugger attach
tpu.logs_all(lines=20)                    # All workers side by side
tpu.health_pretty()                       # Worker dashboard:
#   Worker     Status          Last Log
#   -------------------------------------------
#   worker 0   running         step 1234 | loss 2.31
#   worker 1   running         step 1234 | loss 2.31
#   worker 2   stopped         (no log)

SSH Tunnel

tpu.tunnel(6006)           # TensorBoard: localhost:6006
tpu.tunnel(8888, 9999)     # Jupyter: localhost:9999 -> TPU:8888

Scaling

tpu.scale("v4-32")  # Delete -> recreate with v4-32 -> re-setup

Multi-Zone Failover

tpu = TPU.create_multi_zone("my-tpu", "v4-8",
    zones=["us-central2-b", "us-central1-a", "europe-west4-a"])

Availability Check

TPU.availability("v4-8", zone="us-central2-b")
# {"available": True, "spot_rate": 2.06, "on_demand_rate": 6.18}

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src",
    collect_files=["model.pkl", "results.json"],
    gcs=gcs,
    notify_url="https://hooks.slack.com/...")
# up -> setup -> resume -> run -> wait -> collect -> notify -> down

Scheduled Training

tpu.schedule("python train.py",
    start_after="22:00",   # Wait until 10 PM
    max_cost=10.0)         # Kill if exceeds $10

Environment Snapshot/Restore

tpu.snapshot_env(gcs=gcs)   # pip freeze -> GCS
tpu.restore_env(gcs=gcs)    # Restore after preemption

Secrets (Cloud Secret Manager)

Recommended: Use Google Cloud Secret Manager. Secrets never leave GCP:

from tpuz import SecretManager

# One-time setup
sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

# Training: VM reads secrets directly from GCP
tpu.run("python train.py", secrets=["WANDB_API_KEY", "HF_TOKEN"])

Fallback: env={} writes a .env file via SCP (encrypted, but secrets transit your machine).

See docs/secrets.md for full setup guide and security comparison.

Multi-Host (TPU Pods)

Worker count auto-detected. All SSH commands run in parallel with per-worker retries:

Accelerator Chips Workers Spot $/hr
v4-8 4 1 $2.06
v4-32 16 4 $8.24
v5litepod-8 8 1 $9.60
v5litepod-64 64 8 $76.80
v6e-8 8 1 $9.60

CLI

# Lifecycle
tpuz up my-tpu -a v4-8
tpuz down my-tpu
tpuz status my-tpu
tpuz list
tpuz preflight
tpuz avail v4-8

# Training
tpuz setup my-tpu --pip="flaxchat"
tpuz verify my-tpu
tpuz run my-tpu "python train.py" --sync=./src
tpuz logs my-tpu
tpuz logs-all my-tpu
tpuz kill my-tpu
tpuz wait my-tpu
tpuz collect my-tpu model.pkl results.json

# Debugging
tpuz repl my-tpu
tpuz debug my-tpu "python train.py"
tpuz health my-tpu
tpuz tunnel my-tpu 6006
tpuz scale my-tpu v4-32
tpuz cost my-tpu

# Recovery
tpuz watch my-tpu "python train.py"

# All-in-one
tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown
tpuz run-once my-tpu "python train.py" --sync=. --collect model.pkl

Development Workflow

from tpuz import TPU

# 1. Develop on single host
dev = TPU("dev", accelerator="v4-8")
dev.up()
dev.setup()
dev.repl()  # Interactive development

# 2. Test training
dev.run("python train.py --steps=10", sync="./src")
dev.logs()

# 3. Scale up
dev.scale("v4-32")  # 4 workers now
dev.run("python train.py --steps=50000", sync="./src")
dev.watch("python train.py --steps=50000")

# 4. Collect and cleanup
dev.collect(["model.pkl", "results.json"])
dev.cost_summary()  # $12.36
dev.down()

Documentation

Requirements

  • gcloud CLI installed and authenticated
  • GCP project with TPU quota
  • Python 3.10+
  • Zero Python dependencies

Pair with kgz

pip install kgz     # Kaggle free GPUs
pip install tpuz    # GCP TPU pods

Claude Code Integration

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

License

MIT

Acknowledgments

Cloud TPU resources for developing and testing tpuz were provided by Google's TPU Research Cloud (TRC) program. We gratefully acknowledge their support in making TPU access available for open-source research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.0.tar.gz (36.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpuz-0.1.0-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file tpuz-0.1.0.tar.gz.

File metadata

  • Download URL: tpuz-0.1.0.tar.gz
  • Upload date:
  • Size: 36.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3dc01c8c25d917dcdd27d458e73bab51ec74ee432f0c593e8f424055979861d7
MD5 cb5c0a7fffd894311a269e90b10f5529
BLAKE2b-256 2b8a511dff71729ee88ec6eebe4a84eabafe113f64be2d13f86c44d94b627368

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.0.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpuz-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tpuz-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d267c7f4822eb33e74e23ba1558e59aa7b487950b2664ebc975000eace0e8052
MD5 c1836b67a19fba04bf35e14cef0bd5a7
BLAKE2b-256 2c515c28f34d4496bd6d3f330a6a6bc86ede4c38f6ea5c9e63794ef247228409

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page