Skip to main content

Manage GCP TPU VMs from your terminal — create, run, recover, teardown

Project description

tpuz

Manage GCP TPU & GPU VMs from your terminal.

Create, train, debug, recover, teardown — one command.

PyPI Tests License Python

Getting Started · Docs · GPU Guide · Security


Why?

Training on GCP TPUs/GPUs means juggling gcloud commands, SSH sessions, preemption recovery, cost tracking, and secrets. tpuz handles all of it:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", secrets=["WANDB_API_KEY"], sync="./src")
tpu.logs()
tpu.cost_summary()  # $4.12 (2.0h × $2.06/hr)
tpu.down()

Or one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Install

pip install tpuz

Zero Python dependencies. Requires gcloud CLI (install).

Features

Core

tpu.up()                    # Create TPU VM (idempotent)
tpu.up_queued()             # Queued Resources (reliable spot)
tpu.setup()                 # Install JAX[TPU] + deps
tpu.verify()                # Check JAX on all workers
tpu.run("cmd", sync=".")    # Upload code + launch
tpu.logs()                  # Stream training logs
tpu.wait()                  # Poll for completion
tpu.collect(["model.pkl"])  # Download artifacts
tpu.down()                  # Delete VM

GPU VMs

from tpuz import GCE

vm = GCE.gpu("my-vm", gpu="a100")   # A100 40GB
vm = GCE.gpu("my-vm", gpu="h100x8") # 8x H100
vm = GCE.gpu("my-vm", gpu="t4")     # T4 (cheapest)
vm.up()                              # Same API as TPU

Secrets (Cloud Secret Manager)

from tpuz import SecretManager

sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

tpu.run("python train.py", secrets=["WANDB_API_KEY"])
# Secrets never leave GCP — loaded server-side via IAM

Checkpoints (GCS)

from tpuz import GCS

gcs = GCS("gs://my-bucket")
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Auto-detects latest checkpoint → appends --resume-from-step=5000

Preemption Recovery

tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/...",
    max_retries=5)
# Auto: delete → recreate → setup → restart from checkpoint → Slack notify

Debugging

tpu.repl()                             # Interactive Python REPL
tpu.debug("python train.py")           # VS Code debugger attach
tpu.tunnel(6006)                       # TensorBoard
tpu.health_check()                     # Full health dashboard:
#   Process:   running
#   Heartbeat: fresh (12s ago)
#   Disk:      45% (90/200 GB)
#   GPU:       85% utilization
#   Training:  step 1234/5000 | loss 2.31 | 56,000 tok/s
#   ETA:       ~35m

Cost Control

tpu.cost_summary()                     # $4.12 (2.0h × $2.06/hr)
tpu.set_budget(50, notify_url=slack)   # Alert at $40, kill at $50
tpu.schedule("python train.py",
    start_after="22:00", max_cost=10)  # Train overnight, budget $10

Scaling & Failover

tpu.scale("v4-32")                     # Upgrade: v4-8 → v4-32
TPU.create_multi_zone("tpu", "v4-8",
    zones=["us-central2-b", "europe-west4-a"])  # Try each zone

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src", collect_files=["model.pkl"],
    gcs=gcs, notify_url=slack)
# up → setup → resume → run → wait → collect → notify → down

Profiles & Audit

tpu.save_profile("big-run")            # Save config for reuse
tpu = TPU.from_profile("big-run", "new-tpu")  # Reuse later
tpu.dry_run("python train.py")         # Preview commands without executing

Multi-Host (TPU Pods)

Auto-detected. All SSH commands parallel with per-worker retries:

Accelerator Chips Workers Spot $/hr
v4-8 4 1 $2.06
v4-32 16 4 $8.24
v5litepod-8 8 1 $9.60
v5litepod-64 64 8 $76.80
v6e-8 8 1 $9.60

CLI

tpuz up NAME -a v4-8          tpuz logs NAME
tpuz down NAME                tpuz logs-all NAME
tpuz status NAME              tpuz health NAME
tpuz setup NAME               tpuz tunnel NAME 6006
tpuz verify NAME              tpuz repl NAME
tpuz run NAME "cmd" --sync=.  tpuz debug NAME "cmd"
tpuz wait NAME                tpuz scale NAME v4-32
tpuz kill NAME                tpuz cost NAME
tpuz collect NAME files...    tpuz avail v4-8
tpuz watch NAME "cmd"         tpuz preflight
tpuz train NAME "cmd" -a v4-8 --recover --teardown
tpuz run-once NAME "cmd" --sync=. --collect model.pkl

Documentation

Pair with kgz

pip install kgz     # Kaggle free GPUs — execute code remotely
pip install tpuz    # GCP TPU/GPU pods — manage VM lifecycle

Claude Code

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

Acknowledgments

Cloud TPU resources provided by Google's TPU Research Cloud (TRC) program.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.9.tar.gz (48.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpuz-0.1.9-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file tpuz-0.1.9.tar.gz.

File metadata

  • Download URL: tpuz-0.1.9.tar.gz
  • Upload date:
  • Size: 48.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.9.tar.gz
Algorithm Hash digest
SHA256 064c67e22620f4709e8b992939a23e9e573e222630315fdf1232672a4be592c5
MD5 b5ca6b67516ac435bf5c84a1e883924c
BLAKE2b-256 2c0ec814f5d043de9f8da2c843254a5ac087b2ebedea14da02ebafe25f954256

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.9.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpuz-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: tpuz-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 7b23d12f9dfef7f97e193ff1a992316927ef175679c425383b6e2c7c601256ff
MD5 d3abf20add114e099815f4123ad45751
BLAKE2b-256 9771bcbe324c3a22d27fba97d47add63f6e2237123778297ed8a2068ac609f3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.9-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page