Catch silent fine-tuning bugs (LoRA/PEFT, DDP/FSDP, gradient & optimizer wiring) before you waste GPU-hours.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tjark-neumann

These details have not been verified by PyPI

Project description

guardtower

Catch the silent fine-tuning bugs that waste GPU-hours, before you launch the run.

Most training bugs don't crash. The model runs, the loss goes down, and hours later the results are quietly wrong because the part you meant to train never did. guardtower runs one instrumented training step on a single batch and tells you what's broken, then points each finding at a catalog of failure modes with the fix and a reference.

A pre-flight check for PyTorch training that rules out the expensive, invisible mistakes in a few seconds, with no GPU required.

What it does

Instruments one real step (forward, backward, optimizer wiring) and reports what would silently go wrong over the next six hours.
Auto-detects LoRA / PEFT / QLoRA and multi-GPU (DDP / FSDP) and runs the relevant checks without config.
Treats LoRA's zero-init B matrix (which leaves A with a zero gradient on step 1) as expected rather than a false alarm.
Maps every finding to a catalog entry (what it is, why it's silent, how to fix, reference).
Fails fast: one call aborts a misconfigured run before training starts, in a script, in CI, or as a Hugging Face Trainer callback.
Core checks need only PyTorch. PEFT and transformers are never imported for LoRA detection, so it's safe to run on any model.

What it catches

Area	Examples
LoRA / PEFT	base model left unfrozen, adapter frozen so nothing trains, adapter missing from the optimizer, trainable-% sanity, QLoRA dtype mismatch
Distributed (DDP / FSDP)	an unused parameter that will hang DDP, plain BatchNorm that should be SyncBatchNorm, an initialized process group with an unwrapped model
Core wiring	a parameter that gets no gradient, a trainable param missing from the optimizer, a loss detached from the graph, a loss that doesn't depend on the input
Numerics	NaN/Inf in forward and backward pinpointed to the module, dead ReLUs, exploding or vanishing gradient norms

The full referenced list is in CATALOG.md.

See it catch a real bug

A textbook silent LoRA mistake: the optimizer is built from the model, and the adapter is added afterwards. The optimizer captured the base weights (now frozen) and never sees the adapter tensors, the only thing meant to train. Nothing crashes; the loss even drifts a little. Six GPU-hours later the adapter is exactly where it started.

guardtower flags it on the first step, from two angles, while correctly treating the zero-init adapter gradient as expected:

guardtower catching a silent LoRA optimizer bug on the first batch

Reproduce it: examples/lora_optimizer_bug.py.

Install

pip install guardtower            # core (PyTorch only)
pip install guardtower[hf]        # + Hugging Face Trainer integration

Use it

Drop it into the Hugging Face `Trainer` you already run

One line audits the model on the first real batch and can abort before the expensive part starts:

from guardtower.integrations.huggingface import GuardtowerCallback

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds,
    callbacks=[GuardtowerCallback(raise_on_error=True)],
)
trainer.train()

If the adapter isn't wired up, the run stops immediately:

[guardtower] pre-flight audit on first batch:
────────────────────────────────────────────────────────────────
guardtower: FAIL  (1 fail · 0 warn · 2 info)
────────────────────────────────────────────────────────────────
  ✗ FAIL lora_optimizer  [LORA-003]
      4 trainable adapter tensor(s) are NOT in the optimizer and will
      never update. Build the optimizer from the PEFT model AFTER
      get_peft_model(...)
────────────────────────────────────────────────────────────────

Or call it directly

import guardtower

# Quick, training-free LoRA check:
print(guardtower.lora_summary(model, optimizer=optimizer))

# Full pre-flight (runs one step):
report = guardtower.audit(
    model,
    lambda: loss_fn(model(x), y),   # closure returning a scalar loss
    optimizer=optimizer,
    inputs=x,
)
report.raise_if_errors()            # fail fast in CI / before launching

Keep a cheap NaN watch on a real run

Leave a monitor on for the first steps so a blow-up is pinpointed to the exact module the instant it happens, instead of surfacing as a loss=nan several layers downstream:

with guardtower.monitor(model, on_nonfinite="raise"):
    for batch in loader:
        train_step(batch)

The `Report` object

attribute / method	meaning
`report.ok`	`True` if there are no blocking (FAIL) findings
`report.errors / warnings / infos`	findings by severity
`report.by_check(name)`	findings from a specific check
`report.raise_if_errors()`	raise `GuardtowerError` on any FAIL (chains)
`report.to_dict()`	JSON-friendly dump for logging

Each finding carries a catalog_id linking to its catalog entry.

The catalog

Hit a new failure? Add an entry to guardtower/catalog.py and a check that emits its id. guardtower.catalog() returns the list; guardtower.catalog_markdown() regenerates CATALOG.md.

Test

pytest -q                         # no GPU needed
# or, without pytest installed:
python tests/test_checks.py

License

Apache-2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

tjark-neumann

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guardtower-0.1.0.tar.gz (212.1 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

guardtower-0.1.0-py3-none-any.whl (26.9 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file guardtower-0.1.0.tar.gz.

File metadata

Download URL: guardtower-0.1.0.tar.gz
Upload date: Jun 21, 2026
Size: 212.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for guardtower-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`659dd5954e7cc781e13e983ab14567aed9d2d66007e8b7250ada945aed9e2d5c`
MD5	`0ae4f9274b004eb8b913ff898487ddfe`
BLAKE2b-256	`51a0b4978cfc038e920f834863dff36535120f4e4c872c0b133a296d2574fd14`

See more details on using hashes here.

Provenance

The following attestation bundles were made for guardtower-0.1.0.tar.gz:

Publisher: publish.yml on tjark-neumann/guardtower

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: guardtower-0.1.0.tar.gz
- Subject digest: 659dd5954e7cc781e13e983ab14567aed9d2d66007e8b7250ada945aed9e2d5c
- Sigstore transparency entry: 1902123174
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: tjark-neumann/guardtower@4b8e768e675b9ddef5fc872701b989e291f54f10
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/tjark-neumann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b8e768e675b9ddef5fc872701b989e291f54f10
- Trigger Event: release

File details

Details for the file guardtower-0.1.0-py3-none-any.whl.

File metadata

Download URL: guardtower-0.1.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 26.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for guardtower-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9a8a4217f7e321f80aa28f2bb369fa0e1cd7574a83e767a6182964da5d9918e`
MD5	`1375f9d7a754e7d69a154dd99e5b0564`
BLAKE2b-256	`0cb700344b6bcfe1163f049aa0b8935ced64d4ce3bec1303d389d9346523691f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for guardtower-0.1.0-py3-none-any.whl:

Publisher: publish.yml on tjark-neumann/guardtower

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: guardtower-0.1.0-py3-none-any.whl
- Subject digest: f9a8a4217f7e321f80aa28f2bb369fa0e1cd7574a83e767a6182964da5d9918e
- Sigstore transparency entry: 1902123233
- Sigstore integration time: Jun 21, 2026
Source repository:
- Permalink: tjark-neumann/guardtower@4b8e768e675b9ddef5fc872701b989e291f54f10
- Branch / Tag: refs/tags/0.1.0
- Owner: https://github.com/tjark-neumann
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4b8e768e675b9ddef5fc872701b989e291f54f10
- Trigger Event: release

guardtower 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

guardtower

What it does

What it catches

See it catch a real bug

Install

Use it

Drop it into the Hugging Face Trainer you already run

Or call it directly

Keep a cheap NaN watch on a real run

The Report object

The catalog

Test

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Drop it into the Hugging Face `Trainer` you already run

The `Report` object