Skip to main content

Manage GCP TPU VMs from your terminal — create, run, recover, teardown

Project description

tpuz

Manage GCP TPU & GPU VMs from your terminal.

Create, train, debug, recover, teardown — one command.

PyPI Tests License Python

Getting Started · Docs · GPU Guide · Security


Why?

Training on GCP TPUs/GPUs means juggling gcloud commands, SSH sessions, preemption recovery, cost tracking, and secrets. tpuz handles all of it:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", secrets=["WANDB_API_KEY"], sync="./src")
tpu.logs()
tpu.cost_summary()  # $4.12 (2.0h × $2.06/hr)
tpu.down()

Or one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Install

pip install tpuz

Zero Python dependencies. Requires gcloud CLI (install).

Features

Core

tpu.up()                    # Create TPU VM (idempotent)
tpu.up_queued()             # Queued Resources (reliable spot)
tpu.setup()                 # Install JAX[TPU] + deps
tpu.verify()                # Check JAX on all workers
tpu.run("cmd", sync=".")    # Upload code + launch
tpu.logs()                  # Stream training logs
tpu.wait()                  # Poll for completion
tpu.collect(["model.pkl"])  # Download artifacts
tpu.down()                  # Delete VM

GPU VMs

from tpuz import GCE

vm = GCE.gpu("my-vm", gpu="a100")   # A100 40GB
vm = GCE.gpu("my-vm", gpu="h100x8") # 8x H100
vm = GCE.gpu("my-vm", gpu="t4")     # T4 (cheapest)
vm.up()                              # Same API as TPU

Secrets (Cloud Secret Manager)

from tpuz import SecretManager

sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

tpu.run("python train.py", secrets=["WANDB_API_KEY"])
# Secrets never leave GCP — loaded server-side via IAM

Checkpoints (GCS)

from tpuz import GCS

gcs = GCS("gs://my-bucket")
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Auto-detects latest checkpoint → appends --resume-from-step=5000

Preemption Recovery

tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/...",
    max_retries=5)
# Auto: delete → recreate → setup → restart from checkpoint → Slack notify

Debugging

tpu.repl()                             # Interactive Python REPL
tpu.debug("python train.py")           # VS Code debugger attach
tpu.tunnel(6006)                       # TensorBoard
tpu.health_check()                     # Full health dashboard:
#   Process:   running
#   Heartbeat: fresh (12s ago)
#   Disk:      45% (90/200 GB)
#   GPU:       85% utilization
#   Training:  step 1234/5000 | loss 2.31 | 56,000 tok/s
#   ETA:       ~35m

Cost Control

tpu.cost_summary()                     # $4.12 (2.0h × $2.06/hr)
tpu.set_budget(50, notify_url=slack)   # Alert at $40, kill at $50
tpu.schedule("python train.py",
    start_after="22:00", max_cost=10)  # Train overnight, budget $10

Scaling & Failover

tpu.scale("v4-32")                     # Upgrade: v4-8 → v4-32
TPU.create_multi_zone("tpu", "v4-8",
    zones=["us-central2-b", "europe-west4-a"])  # Try each zone

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src", collect_files=["model.pkl"],
    gcs=gcs, notify_url=slack)
# up → setup → resume → run → wait → collect → notify → down

Profiles & Audit

tpu.save_profile("big-run")            # Save config for reuse
tpu = TPU.from_profile("big-run", "new-tpu")  # Reuse later
tpu.dry_run("python train.py")         # Preview commands without executing

Multi-Host (TPU Pods)

Auto-detected. All SSH commands parallel with per-worker retries:

Accelerator Chips Workers Spot $/hr
v4-8 4 1 $2.06
v4-32 16 4 $8.24
v5litepod-8 8 1 $9.60
v5litepod-64 64 8 $76.80
v6e-8 8 1 $9.60

CLI

tpuz up NAME -a v4-8          tpuz logs NAME
tpuz down NAME                tpuz logs-all NAME
tpuz status NAME              tpuz health NAME
tpuz setup NAME               tpuz tunnel NAME 6006
tpuz verify NAME              tpuz repl NAME
tpuz run NAME "cmd" --sync=.  tpuz debug NAME "cmd"
tpuz wait NAME                tpuz scale NAME v4-32
tpuz kill NAME                tpuz cost NAME
tpuz collect NAME files...    tpuz avail v4-8
tpuz watch NAME "cmd"         tpuz preflight
tpuz train NAME "cmd" -a v4-8 --recover --teardown
tpuz run-once NAME "cmd" --sync=. --collect model.pkl

Documentation

Pair with kgz

pip install kgz     # Kaggle free GPUs — execute code remotely
pip install tpuz    # GCP TPU/GPU pods — manage VM lifecycle

Claude Code

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

Acknowledgments

Cloud TPU resources provided by Google's TPU Research Cloud (TRC) program.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.11.tar.gz (48.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpuz-0.1.11-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file tpuz-0.1.11.tar.gz.

File metadata

  • Download URL: tpuz-0.1.11.tar.gz
  • Upload date:
  • Size: 48.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.11.tar.gz
Algorithm Hash digest
SHA256 371a3a1f4d0001ceb95c348504e805d2293786159600e2ac66c28569780b9dee
MD5 150d71cc4066ed6ea7c3052a60b14b66
BLAKE2b-256 f9e31d9c2fa610852851339580ea4f1f63ed93fd448d4c86042cb4bfcd84cde2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.11.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpuz-0.1.11-py3-none-any.whl.

File metadata

  • Download URL: tpuz-0.1.11-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 fb966d3510e0e65ac8cb8ead065613e14b452a7965d54727955f827f7d036eda
MD5 e33ad48ed50e88b43c79723c2959e216
BLAKE2b-256 9da382fc9a1632a3feef86d7f34a7ddd4e88c2c31295e196499ea3f0a46748f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.11-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page