Catch silent fine-tuning bugs (LoRA/PEFT, DDP/FSDP, gradient & optimizer wiring) before you waste GPU-hours.
Project description
guardtower
Catch the silent fine-tuning bugs that waste GPU-hours, before you launch the run.
Most training bugs don't crash. The model runs, the loss goes down, and hours
later the results are quietly wrong because the part you meant to train never
did. guardtower runs one instrumented training step on a single batch and tells
you what's broken, then points each finding at a catalog of failure modes
with the fix and a reference.
A pre-flight check for PyTorch training that rules out the expensive, invisible mistakes in a few seconds, with no GPU required.
What it does
- Instruments one real step (forward, backward, optimizer wiring) and reports what would silently go wrong over the next six hours.
- Auto-detects LoRA / PEFT / QLoRA and multi-GPU (DDP / FSDP) and runs the relevant checks without config.
- Treats LoRA's zero-init
Bmatrix (which leavesAwith a zero gradient on step 1) as expected rather than a false alarm. - Maps every finding to a catalog entry (
what it is,why it's silent,how to fix,reference). - Fails fast: one call aborts a misconfigured run before training starts, in a
script, in CI, or as a Hugging Face
Trainercallback. - Core checks need only PyTorch. PEFT and transformers are never imported for LoRA detection, so it's safe to run on any model.
What it catches
| Area | Examples |
|---|---|
| LoRA / PEFT | base model left unfrozen, adapter frozen so nothing trains, adapter missing from the optimizer, trainable-% sanity, QLoRA dtype mismatch |
| Distributed (DDP / FSDP) | an unused parameter that will hang DDP, plain BatchNorm that should be SyncBatchNorm, an initialized process group with an unwrapped model |
| Core wiring | a parameter that gets no gradient, a trainable param missing from the optimizer, a loss detached from the graph, a loss that doesn't depend on the input |
| Numerics | NaN/Inf in forward and backward pinpointed to the module, dead ReLUs, exploding or vanishing gradient norms |
The full referenced list is in CATALOG.md.
See it catch a real bug
A textbook silent LoRA mistake: the optimizer is built from the model, and the adapter is added afterwards. The optimizer captured the base weights (now frozen) and never sees the adapter tensors, the only thing meant to train. Nothing crashes; the loss even drifts a little. Six GPU-hours later the adapter is exactly where it started.
guardtower flags it on the first step, from two angles, while correctly treating the zero-init adapter gradient as expected:
Reproduce it: examples/lora_optimizer_bug.py.
Install
pip install guardtower # core (PyTorch only)
pip install guardtower[hf] # + Hugging Face Trainer integration
Use it
Drop it into the Hugging Face Trainer you already run
One line audits the model on the first real batch and can abort before the expensive part starts:
from guardtower.integrations.huggingface import GuardtowerCallback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=ds,
callbacks=[GuardtowerCallback(raise_on_error=True)],
)
trainer.train()
If the adapter isn't wired up, the run stops immediately:
[guardtower] pre-flight audit on first batch:
────────────────────────────────────────────────────────────────
guardtower: FAIL (1 fail · 0 warn · 2 info)
────────────────────────────────────────────────────────────────
✗ FAIL lora_optimizer [LORA-003]
4 trainable adapter tensor(s) are NOT in the optimizer and will
never update. Build the optimizer from the PEFT model AFTER
get_peft_model(...)
────────────────────────────────────────────────────────────────
Or call it directly
import guardtower
# Quick, training-free LoRA check:
print(guardtower.lora_summary(model, optimizer=optimizer))
# Full pre-flight (runs one step):
report = guardtower.audit(
model,
lambda: loss_fn(model(x), y), # closure returning a scalar loss
optimizer=optimizer,
inputs=x,
)
report.raise_if_errors() # fail fast in CI / before launching
Keep a cheap NaN watch on a real run
Leave a monitor on for the first steps so a blow-up is pinpointed to the exact
module the instant it happens, instead of surfacing as a loss=nan several
layers downstream:
with guardtower.monitor(model, on_nonfinite="raise"):
for batch in loader:
train_step(batch)
The Report object
| attribute / method | meaning |
|---|---|
report.ok |
True if there are no blocking (FAIL) findings |
report.errors / warnings / infos |
findings by severity |
report.by_check(name) |
findings from a specific check |
report.raise_if_errors() |
raise GuardtowerError on any FAIL (chains) |
report.to_dict() |
JSON-friendly dump for logging |
Each finding carries a catalog_id linking to its catalog entry.
The catalog
Hit a new failure? Add an entry to guardtower/catalog.py and a check that emits
its id. guardtower.catalog() returns the list; guardtower.catalog_markdown()
regenerates CATALOG.md.
Test
pytest -q # no GPU needed
# or, without pytest installed:
python tests/test_checks.py
License
Apache-2.0. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file guardtower-0.1.0.tar.gz.
File metadata
- Download URL: guardtower-0.1.0.tar.gz
- Upload date:
- Size: 212.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
659dd5954e7cc781e13e983ab14567aed9d2d66007e8b7250ada945aed9e2d5c
|
|
| MD5 |
0ae4f9274b004eb8b913ff898487ddfe
|
|
| BLAKE2b-256 |
51a0b4978cfc038e920f834863dff36535120f4e4c872c0b133a296d2574fd14
|
Provenance
The following attestation bundles were made for guardtower-0.1.0.tar.gz:
Publisher:
publish.yml on tjark-neumann/guardtower
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
guardtower-0.1.0.tar.gz -
Subject digest:
659dd5954e7cc781e13e983ab14567aed9d2d66007e8b7250ada945aed9e2d5c - Sigstore transparency entry: 1902123174
- Sigstore integration time:
-
Permalink:
tjark-neumann/guardtower@4b8e768e675b9ddef5fc872701b989e291f54f10 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/tjark-neumann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4b8e768e675b9ddef5fc872701b989e291f54f10 -
Trigger Event:
release
-
Statement type:
File details
Details for the file guardtower-0.1.0-py3-none-any.whl.
File metadata
- Download URL: guardtower-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9a8a4217f7e321f80aa28f2bb369fa0e1cd7574a83e767a6182964da5d9918e
|
|
| MD5 |
1375f9d7a754e7d69a154dd99e5b0564
|
|
| BLAKE2b-256 |
0cb700344b6bcfe1163f049aa0b8935ced64d4ce3bec1303d389d9346523691f
|
Provenance
The following attestation bundles were made for guardtower-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on tjark-neumann/guardtower
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
guardtower-0.1.0-py3-none-any.whl -
Subject digest:
f9a8a4217f7e321f80aa28f2bb369fa0e1cd7574a83e767a6182964da5d9918e - Sigstore transparency entry: 1902123233
- Sigstore integration time:
-
Permalink:
tjark-neumann/guardtower@4b8e768e675b9ddef5fc872701b989e291f54f10 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/tjark-neumann
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@4b8e768e675b9ddef5fc872701b989e291f54f10 -
Trigger Event:
release
-
Statement type: