Manage GCP TPU VMs from your terminal — create, run, recover, teardown

These details have not been verified by PyPI

Project links

Project description

tpuz

Manage GCP TPU VMs from your terminal. Create, train, debug, recover, teardown — one command.

pip install tpuz

Why?

Training on TPU pods requires 10+ gcloud commands, manual SSH to each worker, no preemption handling, no cost visibility, and painful debugging. tpuz wraps it all:

from tpuz import TPU

tpu = TPU("my-tpu", accelerator="v4-8")
tpu.up()
tpu.setup()
tpu.run("python train.py", sync="./src")
tpu.logs()
tpu.down()

Or in one command:

tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown

Features

Lifecycle

tpu.preflight()            # Verify gcloud config
tpu.up()                   # Create VM (idempotent)
tpu.up_queued()            # Queued Resources API (reliable spot)
tpu.down()                 # Delete VM
tpu.info()                 # State, IPs, accelerator
tpu.setup(extra_pip="jax") # Install JAX[TPU] + deps
tpu.verify()               # Verify JAX on all workers

Training

tpu.run("python train.py", sync="./src", env={"KEY": "val"})
tpu.logs()                 # Stream logs (Ctrl-C to detach)
tpu.logs_all()             # Color-coded logs from ALL workers
tpu.is_running()           # Check if alive
tpu.kill()                 # Stop training
tpu.wait()                 # Poll for COMPLETE/FAILED
tpu.collect(["model.pkl"]) # Download artifacts

Cost Tracking

tpu.cost_summary()  # "$4.12 (2.0h x $2.06/hr v4-8 spot)"

GCS Checkpoint Sync

from tpuz import GCS

gcs = GCS("gs://my-bucket")
gcs.upload_checkpoint("./ckpt", "run-01", step=1000)
gcs.latest_step("run-01")   # 5000
gcs.list_runs()              # ["run-01", "run-02"]

# Auto-resume from latest checkpoint
tpu.run_with_resume("python train.py", gcs=gcs, run_name="run-01")
# Finds step 5000 -> appends --resume-from-step=5000

Preemption Recovery

tpu.watch("python train.py", max_retries=5)
# Polls every 60s -> on PREEMPTED: delete -> recreate -> setup -> restart

# With Slack notifications
tpu.watch_notify("python train.py",
    notify_url="https://hooks.slack.com/services/...",
    max_retries=5)

Debugging

tpu.repl()                                # Interactive Python on worker 0
tpu.debug("python train.py", port=5678)   # VS Code debugger attach
tpu.logs_all(lines=20)                    # All workers side by side
tpu.health_pretty()                       # Worker dashboard:
#   Worker     Status          Last Log
#   -------------------------------------------
#   worker 0   running         step 1234 | loss 2.31
#   worker 1   running         step 1234 | loss 2.31
#   worker 2   stopped         (no log)

SSH Tunnel

tpu.tunnel(6006)           # TensorBoard: localhost:6006
tpu.tunnel(8888, 9999)     # Jupyter: localhost:9999 -> TPU:8888

Scaling

tpu.scale("v4-32")  # Delete -> recreate with v4-32 -> re-setup

Multi-Zone Failover

tpu = TPU.create_multi_zone("my-tpu", "v4-8",
    zones=["us-central2-b", "us-central1-a", "europe-west4-a"])

Availability Check

TPU.availability("v4-8", zone="us-central2-b")
# {"available": True, "spot_rate": 2.06, "on_demand_rate": 6.18}

Run-Once (Docker-like)

tpu.run_once("python train.py",
    sync="./src",
    collect_files=["model.pkl", "results.json"],
    gcs=gcs,
    notify_url="https://hooks.slack.com/...")
# up -> setup -> resume -> run -> wait -> collect -> notify -> down

Scheduled Training

tpu.schedule("python train.py",
    start_after="22:00",   # Wait until 10 PM
    max_cost=10.0)         # Kill if exceeds $10

Environment Snapshot/Restore

tpu.snapshot_env(gcs=gcs)   # pip freeze -> GCS
tpu.restore_env(gcs=gcs)    # Restore after preemption

Secrets (Cloud Secret Manager)

Recommended: Use Google Cloud Secret Manager. Secrets never leave GCP:

from tpuz import SecretManager

# One-time setup
sm = SecretManager()
sm.create("WANDB_API_KEY", "your-key")
sm.grant_tpu_access_all()

# Training: VM reads secrets directly from GCP
tpu.run("python train.py", secrets=["WANDB_API_KEY", "HF_TOKEN"])

Fallback: env={} writes a .env file via SCP (encrypted, but secrets transit your machine).

See docs/secrets.md for full setup guide and security comparison.

Multi-Host (TPU Pods)

Worker count auto-detected. All SSH commands run in parallel with per-worker retries:

Accelerator	Chips	Workers	Spot $/hr
`v4-8`	4	1	$2.06
`v4-32`	16	4	$8.24
`v5litepod-8`	8	1	$9.60
`v5litepod-64`	64	8	$76.80
`v6e-8`	8	1	$9.60

CLI

# Lifecycle
tpuz up my-tpu -a v4-8
tpuz down my-tpu
tpuz status my-tpu
tpuz list
tpuz preflight
tpuz avail v4-8

# Training
tpuz setup my-tpu --pip="flaxchat"
tpuz verify my-tpu
tpuz run my-tpu "python train.py" --sync=./src
tpuz logs my-tpu
tpuz logs-all my-tpu
tpuz kill my-tpu
tpuz wait my-tpu
tpuz collect my-tpu model.pkl results.json

# Debugging
tpuz repl my-tpu
tpuz debug my-tpu "python train.py"
tpuz health my-tpu
tpuz tunnel my-tpu 6006
tpuz scale my-tpu v4-32
tpuz cost my-tpu

# Recovery
tpuz watch my-tpu "python train.py"

# All-in-one
tpuz train my-tpu "python train.py" -a v4-8 --sync=. --recover --teardown
tpuz run-once my-tpu "python train.py" --sync=. --collect model.pkl

Development Workflow

from tpuz import TPU

# 1. Develop on single host
dev = TPU("dev", accelerator="v4-8")
dev.up()
dev.setup()
dev.repl()  # Interactive development

# 2. Test training
dev.run("python train.py --steps=10", sync="./src")
dev.logs()

# 3. Scale up
dev.scale("v4-32")  # 4 workers now
dev.run("python train.py --steps=50000", sync="./src")
dev.watch("python train.py --steps=50000")

# 4. Collect and cleanup
dev.collect(["model.pkl", "results.json"])
dev.cost_summary()  # $12.36
dev.down()

Documentation

docs/secrets.md — Secrets & security guide (Cloud Secret Manager setup)
docs/best-practices.md — Training workflow, cost optimization, multi-host tips
SKILL.md — Claude Code skill reference
CLAUDE.md — Quick reference for AI agents

Requirements

gcloud CLI installed and authenticated
GCP project with TPU quota
Python 3.10+
Zero Python dependencies

Pair with kgz

pip install kgz     # Kaggle free GPUs
pip install tpuz    # GCP TPU pods

Claude Code Integration

mkdir -p ~/.claude/skills/tpuz-guide
cp SKILL.md ~/.claude/skills/tpuz-guide/skill.md

License

MIT

Acknowledgments

Cloud TPU resources for developing and testing tpuz were provided by Google's TPU Research Cloud (TRC) program. We gratefully acknowledge their support in making TPU access available for open-source research.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.11

Apr 5, 2026

0.1.10

Apr 5, 2026

0.1.9

Apr 5, 2026

0.1.8

Apr 5, 2026

0.1.7

Apr 5, 2026

0.1.6

Apr 5, 2026

0.1.5

Apr 4, 2026

0.1.4

Apr 4, 2026

This version

0.1.0

Apr 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tpuz-0.1.0.tar.gz (36.8 kB view details)

Uploaded Apr 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tpuz-0.1.0-py3-none-any.whl (28.9 kB view details)

Uploaded Apr 4, 2026 Python 3

File details

Details for the file tpuz-0.1.0.tar.gz.

File metadata

Download URL: tpuz-0.1.0.tar.gz
Upload date: Apr 4, 2026
Size: 36.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3dc01c8c25d917dcdd27d458e73bab51ec74ee432f0c593e8f424055979861d7`
MD5	`cb5c0a7fffd894311a269e90b10f5529`
BLAKE2b-256	`2b8a511dff71729ee88ec6eebe4a84eabafe113f64be2d13f86c44d94b627368`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.0.tar.gz:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tpuz-0.1.0.tar.gz
- Subject digest: 3dc01c8c25d917dcdd27d458e73bab51ec74ee432f0c593e8f424055979861d7
- Sigstore transparency entry: 1237280505
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: mlnomadpy/tpuz@a486953170de496b28e24b335fa6a0ba6709cf3d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/mlnomadpy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@a486953170de496b28e24b335fa6a0ba6709cf3d
- Trigger Event: push

File details

Details for the file tpuz-0.1.0-py3-none-any.whl.

File metadata

Download URL: tpuz-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2026
Size: 28.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tpuz-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d267c7f4822eb33e74e23ba1558e59aa7b487950b2664ebc975000eace0e8052`
MD5	`c1836b67a19fba04bf35e14cef0bd5a7`
BLAKE2b-256	`2c515c28f34d4496bd6d3f330a6a6bc86ede4c38f6ea5c9e63794ef247228409`

See more details on using hashes here.

Provenance

The following attestation bundles were made for tpuz-0.1.0-py3-none-any.whl:

Publisher: publish.yaml on mlnomadpy/tpuz

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: tpuz-0.1.0-py3-none-any.whl
- Subject digest: d267c7f4822eb33e74e23ba1558e59aa7b487950b2664ebc975000eace0e8052
- Sigstore transparency entry: 1237280507
- Sigstore integration time: Apr 4, 2026
Source repository:
- Permalink: mlnomadpy/tpuz@a486953170de496b28e24b335fa6a0ba6709cf3d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/mlnomadpy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@a486953170de496b28e24b335fa6a0ba6709cf3d
- Trigger Event: push

tpuz 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tpuz

Why?

Features

Lifecycle

Training

Cost Tracking

GCS Checkpoint Sync

Preemption Recovery

Debugging

SSH Tunnel

Scaling

Multi-Zone Failover

Availability Check

Run-Once (Docker-like)

Scheduled Training

Environment Snapshot/Restore

Secrets (Cloud Secret Manager)

Multi-Host (TPU Pods)

CLI

Development Workflow

Documentation

Requirements

Pair with kgz

Claude Code Integration

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance