Skip to main content

Manage GCP TPU VMs from your terminal — create, run, recover, teardown

Project description

tpuz

Manage GCP TPU & GPU VMs from your terminal.

Create, train, debug, recover, teardown — one command.

PyPI Tests License Python

Getting Started · Docs · GPU Guide · Security


Why?

Training on GCP TPUs/GPUs means juggling gcloud commands, SSH sessions, preemption recovery, cost tracking, and secrets. tpuz handles all of it:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", secrets=["WANDB_API_KEY"], sync="./src")
tpu.logs()
tpu.cost_summary()  # $4.12 (2.0h × $2.06/hr)
tpu.down()

Or one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Install

pip install tpuz

Zero Python dependencies. Requires gcloud CLI (install).

Features

Core

tpu.up()                    # Create TPU VM (idempotent)
tpu.up_queued()             # Queued Resources (reliable spot)
tpu.setup()                 # Install JAX[TPU] + deps
tpu.verify()                # Check JAX on all workers
tpu.run("cmd", sync=".")    # Upload code + launch
tpu.logs()                  # Stream training logs
tpu.wait()                  # Poll for completion
tpu.collect(["model.pkl"])  # Download artifacts
tpu.down()                  # Delete VM

GPU VMs

from tpuz import GCE

vm = GCE.gpu("my-vm", gpu="a100")   # A100 40GB
vm = GCE.gpu("my-vm", gpu="h100x8") # 8x H100
vm = GCE.gpu("my-vm", gpu="t4")     # T4 (cheapest)
vm.up()                              # Same API as TPU

Secrets (Cloud Secret Manager)

from tpuz import SecretManager

sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

tpu.run("python train.py", secrets=["WANDB_API_KEY"])
# Secrets never leave GCP — loaded server-side via IAM

Checkpoints (GCS)

from tpuz import GCS

gcs = GCS("gs://my-bucket")
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Auto-detects latest checkpoint → appends --resume-from-step=5000

Preemption Recovery

tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/...",
    max_retries=5)
# Auto: delete → recreate → setup → restart from checkpoint → Slack notify

Debugging

tpu.repl()                             # Interactive Python REPL
tpu.debug("python train.py")           # VS Code debugger attach
tpu.tunnel(6006)                       # TensorBoard
tpu.health_check()                     # Full health dashboard:
#   Process:   running
#   Heartbeat: fresh (12s ago)
#   Disk:      45% (90/200 GB)
#   GPU:       85% utilization
#   Training:  step 1234/5000 | loss 2.31 | 56,000 tok/s
#   ETA:       ~35m

Cost Control

tpu.cost_summary()                     # $4.12 (2.0h × $2.06/hr)
tpu.set_budget(50, notify_url=slack)   # Alert at $40, kill at $50
tpu.schedule("python train.py",
    start_after="22:00", max_cost=10)  # Train overnight, budget $10

Scaling & Failover

tpu.scale("v4-32")                     # Upgrade: v4-8 → v4-32
TPU.create_multi_zone("tpu", "v4-8",
    zones=["us-central2-b", "europe-west4-a"])  # Try each zone

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src", collect_files=["model.pkl"],
    gcs=gcs, notify_url=slack)
# up → setup → resume → run → wait → collect → notify → down

Profiles & Audit

tpu.save_profile("big-run")            # Save config for reuse
tpu = TPU.from_profile("big-run", "new-tpu")  # Reuse later
tpu.dry_run("python train.py")         # Preview commands without executing

Multi-Host (TPU Pods)

Auto-detected. All SSH commands parallel with per-worker retries:

Accelerator Chips Workers Spot $/hr
v4-8 4 1 $2.06
v4-32 16 4 $8.24
v5litepod-8 8 1 $9.60
v5litepod-64 64 8 $76.80
v6e-8 8 1 $9.60

CLI

tpuz up NAME -a v4-8          tpuz logs NAME
tpuz down NAME                tpuz logs-all NAME
tpuz status NAME              tpuz health NAME
tpuz setup NAME               tpuz tunnel NAME 6006
tpuz verify NAME              tpuz repl NAME
tpuz run NAME "cmd" --sync=.  tpuz debug NAME "cmd"
tpuz wait NAME                tpuz scale NAME v4-32
tpuz kill NAME                tpuz cost NAME
tpuz collect NAME files...    tpuz avail v4-8
tpuz watch NAME "cmd"         tpuz preflight
tpuz train NAME "cmd" -a v4-8 --recover --teardown
tpuz run-once NAME "cmd" --sync=. --collect model.pkl

Documentation

Pair with kgz

pip install kgz     # Kaggle free GPUs — execute code remotely
pip install tpuz    # GCP TPU/GPU pods — manage VM lifecycle

Claude Code

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

Acknowledgments

Cloud TPU resources provided by Google's TPU Research Cloud (TRC) program.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.5.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpuz-0.1.5-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file tpuz-0.1.5.tar.gz.

File metadata

  • Download URL: tpuz-0.1.5.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.5.tar.gz
Algorithm Hash digest
SHA256 814001cbbaa6e8bde7bd33a6292ce5c1739cf5bb6dd4b417e5f682e2db923753
MD5 7a3bb4d546b07d98ddf1482f16f9ec02
BLAKE2b-256 f31d5a726c02f715de2b38d30c0d2dc3254b4995dfdb264e09b9b4335e9fb78a

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.5.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpuz-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: tpuz-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 83083ff7a86ec5ba4822cdee10837b4648594963094035c563aa3b26a5f09e3f
MD5 cc987d3c8d50d984c75ef1c412d7edab
BLAKE2b-256 0284aba269f8c08c9208556e0708bab152dbb621c4e970ee60d3399df533dc0e

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.5-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page