Skip to main content

Manage GCP TPU VMs from your terminal — create, run, recover, teardown

Project description

tpuz

Manage GCP TPU & GPU VMs from your terminal.

Create, train, debug, recover, teardown — one command.

PyPI Tests License Python

Getting Started · Docs · GPU Guide · Security


Why?

Training on GCP TPUs/GPUs means juggling gcloud commands, SSH sessions, preemption recovery, cost tracking, and secrets. tpuz handles all of it:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", secrets=["WANDB_API_KEY"], sync="./src")
tpu.logs()
tpu.cost_summary()  # $4.12 (2.0h × $2.06/hr)
tpu.down()

Or one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Install

pip install tpuz

Zero Python dependencies. Requires gcloud CLI (install).

Features

Core

tpu.up()                    # Create TPU VM (idempotent)
tpu.up_queued()             # Queued Resources (reliable spot)
tpu.setup()                 # Install JAX[TPU] + deps
tpu.verify()                # Check JAX on all workers
tpu.run("cmd", sync=".")    # Upload code + launch
tpu.logs()                  # Stream training logs
tpu.wait()                  # Poll for completion
tpu.collect(["model.pkl"])  # Download artifacts
tpu.down()                  # Delete VM

GPU VMs

from tpuz import GCE

vm = GCE.gpu("my-vm", gpu="a100")   # A100 40GB
vm = GCE.gpu("my-vm", gpu="h100x8") # 8x H100
vm = GCE.gpu("my-vm", gpu="t4")     # T4 (cheapest)
vm.up()                              # Same API as TPU

Secrets (Cloud Secret Manager)

from tpuz import SecretManager

sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

tpu.run("python train.py", secrets=["WANDB_API_KEY"])
# Secrets never leave GCP — loaded server-side via IAM

Checkpoints (GCS)

from tpuz import GCS

gcs = GCS("gs://my-bucket")
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Auto-detects latest checkpoint → appends --resume-from-step=5000

Preemption Recovery

tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/...",
    max_retries=5)
# Auto: delete → recreate → setup → restart from checkpoint → Slack notify

Debugging

tpu.repl()                             # Interactive Python REPL
tpu.debug("python train.py")           # VS Code debugger attach
tpu.tunnel(6006)                       # TensorBoard
tpu.health_check()                     # Full health dashboard:
#   Process:   running
#   Heartbeat: fresh (12s ago)
#   Disk:      45% (90/200 GB)
#   GPU:       85% utilization
#   Training:  step 1234/5000 | loss 2.31 | 56,000 tok/s
#   ETA:       ~35m

Cost Control

tpu.cost_summary()                     # $4.12 (2.0h × $2.06/hr)
tpu.set_budget(50, notify_url=slack)   # Alert at $40, kill at $50
tpu.schedule("python train.py",
    start_after="22:00", max_cost=10)  # Train overnight, budget $10

Scaling & Failover

tpu.scale("v4-32")                     # Upgrade: v4-8 → v4-32
TPU.create_multi_zone("tpu", "v4-8",
    zones=["us-central2-b", "europe-west4-a"])  # Try each zone

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src", collect_files=["model.pkl"],
    gcs=gcs, notify_url=slack)
# up → setup → resume → run → wait → collect → notify → down

Profiles & Audit

tpu.save_profile("big-run")            # Save config for reuse
tpu = TPU.from_profile("big-run", "new-tpu")  # Reuse later
tpu.dry_run("python train.py")         # Preview commands without executing

Multi-Host (TPU Pods)

Auto-detected. All SSH commands parallel with per-worker retries:

Accelerator Chips Workers Spot $/hr
v4-8 4 1 $2.06
v4-32 16 4 $8.24
v5litepod-8 8 1 $9.60
v5litepod-64 64 8 $76.80
v6e-8 8 1 $9.60

CLI

tpuz up NAME -a v4-8          tpuz logs NAME
tpuz down NAME                tpuz logs-all NAME
tpuz status NAME              tpuz health NAME
tpuz setup NAME               tpuz tunnel NAME 6006
tpuz verify NAME              tpuz repl NAME
tpuz run NAME "cmd" --sync=.  tpuz debug NAME "cmd"
tpuz wait NAME                tpuz scale NAME v4-32
tpuz kill NAME                tpuz cost NAME
tpuz collect NAME files...    tpuz avail v4-8
tpuz watch NAME "cmd"         tpuz preflight
tpuz train NAME "cmd" -a v4-8 --recover --teardown
tpuz run-once NAME "cmd" --sync=. --collect model.pkl

Documentation

Pair with kgz

pip install kgz     # Kaggle free GPUs — execute code remotely
pip install tpuz    # GCP TPU/GPU pods — manage VM lifecycle

Claude Code

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

Acknowledgments

Cloud TPU resources provided by Google's TPU Research Cloud (TRC) program.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.4.tar.gz (45.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tpuz-0.1.4-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file tpuz-0.1.4.tar.gz.

File metadata

  • Download URL: tpuz-0.1.4.tar.gz
  • Upload date:
  • Size: 45.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b251d98a9fd0eea6df86e020093d5a044a6be7664ef0935b6c4c3122592d5a4b
MD5 41f96f8193110862f52fe1c2a2fe1d75
BLAKE2b-256 70f90d4d9621f898d095273e137fda5498b0825aecb2693cc20e08ed86a98270

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.4.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tpuz-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: tpuz-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 738fe12082af7814596c2472858a315e5ecd1dce3a08773662764de2e9892f36
MD5 884f97da9176e7a8ee4a92c2f9fc3ff9
BLAKE2b-256 174b4c574a67a630c2a9bd6b01999f785c05e0c8d4025f78f46c342a82b10c74

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.4-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page