Skip to main content

Catch silent fine-tuning bugs (LoRA/PEFT, DDP/FSDP, gradient & optimizer wiring) before you waste GPU-hours.

Project description

guardtower

Catch the silent fine-tuning bugs that waste GPU-hours, before you launch the run.

Most training bugs don't crash. The model runs, the loss goes down, and hours later the results are quietly wrong because the part you meant to train never did. guardtower runs one instrumented training step on a single batch and tells you what's broken, then points each finding at a catalog of failure modes with the fix and a reference.

A pre-flight check for PyTorch training that rules out the expensive, invisible mistakes in a few seconds, with no GPU required.

What it does

  • Instruments one real step (forward, backward, optimizer wiring) and reports what would silently go wrong over the next six hours.
  • Auto-detects LoRA / PEFT / QLoRA and multi-GPU (DDP / FSDP) and runs the relevant checks without config.
  • Treats LoRA's zero-init B matrix (which leaves A with a zero gradient on step 1) as expected rather than a false alarm.
  • Maps every finding to a catalog entry (what it is, why it's silent, how to fix, reference).
  • Fails fast: one call aborts a misconfigured run before training starts, in a script, in CI, or as a Hugging Face Trainer callback.
  • Core checks need only PyTorch. PEFT and transformers are never imported for LoRA detection, so it's safe to run on any model.

What it catches

Area Examples
LoRA / PEFT base model left unfrozen, adapter frozen so nothing trains, adapter missing from the optimizer, trainable-% sanity, QLoRA dtype mismatch
Distributed (DDP / FSDP) an unused parameter that will hang DDP, plain BatchNorm that should be SyncBatchNorm, an initialized process group with an unwrapped model
Core wiring a parameter that gets no gradient, a trainable param missing from the optimizer, a loss detached from the graph, a loss that doesn't depend on the input
Numerics NaN/Inf in forward and backward pinpointed to the module, dead ReLUs, exploding or vanishing gradient norms

The full referenced list is in CATALOG.md.

See it catch a real bug

A textbook silent LoRA mistake: the optimizer is built from the model, and the adapter is added afterwards. The optimizer captured the base weights (now frozen) and never sees the adapter tensors, the only thing meant to train. Nothing crashes; the loss even drifts a little. Six GPU-hours later the adapter is exactly where it started.

guardtower flags it on the first step, from two angles, while correctly treating the zero-init adapter gradient as expected:

guardtower catching a silent LoRA optimizer bug on the first batch

Reproduce it: examples/lora_optimizer_bug.py.

Install

pip install guardtower            # core (PyTorch only)
pip install guardtower[hf]        # + Hugging Face Trainer integration

Use it

Drop it into the Hugging Face Trainer you already run

One line audits the model on the first real batch and can abort before the expensive part starts:

from guardtower.integrations.huggingface import GuardtowerCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds,
    callbacks=[GuardtowerCallback(raise_on_error=True)],
)
trainer.train()

If the adapter isn't wired up, the run stops immediately:

[guardtower] pre-flight audit on first batch:
────────────────────────────────────────────────────────────────
guardtower: FAIL  (1 fail · 0 warn · 2 info)
────────────────────────────────────────────────────────────────
  ✗ FAIL lora_optimizer  [LORA-003]
      4 trainable adapter tensor(s) are NOT in the optimizer and will
      never update. Build the optimizer from the PEFT model AFTER
      get_peft_model(...)
────────────────────────────────────────────────────────────────

Or call it directly

import guardtower

# Quick, training-free LoRA check:
print(guardtower.lora_summary(model, optimizer=optimizer))

# Full pre-flight (runs one step):
report = guardtower.audit(
    model,
    lambda: loss_fn(model(x), y),   # closure returning a scalar loss
    optimizer=optimizer,
    inputs=x,
)
report.raise_if_errors()            # fail fast in CI / before launching

Keep a cheap NaN watch on a real run

Leave a monitor on for the first steps so a blow-up is pinpointed to the exact module the instant it happens, instead of surfacing as a loss=nan several layers downstream:

with guardtower.monitor(model, on_nonfinite="raise"):
    for batch in loader:
        train_step(batch)

The Report object

attribute / method meaning
report.ok True if there are no blocking (FAIL) findings
report.errors / warnings / infos findings by severity
report.by_check(name) findings from a specific check
report.raise_if_errors() raise GuardtowerError on any FAIL (chains)
report.to_dict() JSON-friendly dump for logging

Each finding carries a catalog_id linking to its catalog entry.

The catalog

Hit a new failure? Add an entry to guardtower/catalog.py and a check that emits its id. guardtower.catalog() returns the list; guardtower.catalog_markdown() regenerates CATALOG.md.

Test

pytest -q                         # no GPU needed
# or, without pytest installed:
python tests/test_checks.py

License

Apache-2.0. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guardtower-0.1.0.tar.gz (212.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

guardtower-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file guardtower-0.1.0.tar.gz.

File metadata

  • Download URL: guardtower-0.1.0.tar.gz
  • Upload date:
  • Size: 212.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for guardtower-0.1.0.tar.gz
Algorithm Hash digest
SHA256 659dd5954e7cc781e13e983ab14567aed9d2d66007e8b7250ada945aed9e2d5c
MD5 0ae4f9274b004eb8b913ff898487ddfe
BLAKE2b-256 51a0b4978cfc038e920f834863dff36535120f4e4c872c0b133a296d2574fd14

See more details on using hashes here.

Provenance

The following attestation bundles were made for guardtower-0.1.0.tar.gz:

Publisher: publish.yml on tjark-neumann/guardtower

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file guardtower-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: guardtower-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for guardtower-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9a8a4217f7e321f80aa28f2bb369fa0e1cd7574a83e767a6182964da5d9918e
MD5 1375f9d7a754e7d69a154dd99e5b0564
BLAKE2b-256 0cb700344b6bcfe1163f049aa0b8935ced64d4ce3bec1303d389d9346523691f

See more details on using hashes here.

Provenance

The following attestation bundles were made for guardtower-0.1.0-py3-none-any.whl:

Publisher: publish.yml on tjark-neumann/guardtower

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page