Minimal-decision tools for reproducible, debuggable training experiments.
Project description
TrainKeeper
Training-Time System Guardrails for Reliable AI
TrainKeeper is a training-time reliability framework for machine learning systems.
It adds lightweight guardrails around existing training code to make experiments:
- reproducible
- debuggable
- data-safe
- training-stable
- and system-verifiable
without replacing your stack.
TrainKeeper focuses on what most frameworks ignore:
👉 what happens inside the training loop.
🚨 Why TrainKeeper exists
Most critical ML failures are silent:
- non-deterministic experiments
- unnoticed data corruption or drift
- exploding / vanishing gradients
- NaN loss propagation
- broken resumes and unreproducible results
TrainKeeper turns training into a controlled system rather than a script.
It does this by providing:
- experiment control
- data integrity checks
- training-time instrumentation
- automatic failure capture
- and system-level validation scenarios
📦 Install
pip install trainkeeper
Optional extras:
pip install trainkeeper[torch]
pip install trainkeeper[wandb]
pip install trainkeeper[mlflow]
⚡ Quick start
from trainkeeper.experiment import run_reproducible
@run_reproducible(auto_capture_git=True)
def train():
print("TrainKeeper is running.")
# your normal training loop
if __name__ == "__main__":
train()
Each run automatically produces:
experiment.yaml,run.jsonsystem.json,env.txtseeds.json,run.sh- checkpoints and failure reports
No pipeline rewrite. No framework lock-in.
🧠 Core runtime modules
| Module | Purpose |
|---|---|
experiment |
reproducible runs, environment capture, replay |
datacheck |
schema enforcement, drift detection, data profiling |
debugger |
training hooks, instability detection, failure snapshots |
trainutils |
deterministic dataloaders, mixed precision, checkpoints |
monitor |
runtime metrics and behavior tracking |
pkg |
export helpers (ONNX, TorchScript, packaging) |
🖥 CLI
tk init
tk run -- python train.py
tk replay <exp-id> -- python train.py
tk compare <exp-a> <exp-b>
tk repro-summary <runs-dir>
tk doctor
The CLI exposes TrainKeeper as a system tool, not just a library.
🧪 System validation (what makes TrainKeeper different)
TrainKeeper is not only a framework.
It is validated through a multi-scenario reliability suite (in the GitHub repo):
Scenario 1 — Reproducibility Lab
Deterministic execution, resume behavior, experiment traceability.
Scenario 2 — Data Corruption Lab
Schema violations, NaNs, label shift, silent distribution drift.
Scenario 3 — Training Robustness Lab
Exploding gradients, NaN loss, optimizer instability, bad batch capture.
These scenarios are orchestrated by a system hardening layer that produces:
- unified summaries
- failure matrices
- cross-scenario system reports
TrainKeeper therefore tests itself.
- PyPI package = runtime framework only
- Scenarios & system tests = repository-only
🏗 Architecture
TrainKeeper inserts a guardrail layer between your training code and the system.
User Training Code
↓
TrainKeeper Runtime (experiment, datacheck, debugger, trainutils)
↓
Structured Artifacts & Reports
↓
System Validation Layer (scenarios + system tests)
(Full architecture diagram is available in the GitHub repository.)
🎓 Typical use cases
- research reproducibility & experiment audits
- training-time debugging
- data integrity enforcement
- reliability testing for ML systems
- controlled failure experiments
- AI systems research platforms
🔗 Project links
- GitHub: https://github.com/mosh3eb/TrainKeeper
- Issues & roadmap: https://github.com/mosh3eb/TrainKeeper/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trainkeeper-0.2.3.tar.gz.
File metadata
- Download URL: trainkeeper-0.2.3.tar.gz
- Upload date:
- Size: 28.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27f8441723ad17aa13b9ad7f16bf8ed087fff150b0492b63fc28bfed43f1f53b
|
|
| MD5 |
c1b669942bbd23b7b3c4b8d6eba37dca
|
|
| BLAKE2b-256 |
eb923ab231f50a3e452f79a2b4aee2f4022e775f25b46ed64e298a28844852ce
|
File details
Details for the file trainkeeper-0.2.3-py3-none-any.whl.
File metadata
- Download URL: trainkeeper-0.2.3-py3-none-any.whl
- Upload date:
- Size: 30.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6bc1589ba4377d892bf5fd36e617e6329d83d4d3f71fb35cc8beba367c0a81e
|
|
| MD5 |
63078e3917ac2085f878825fde4805d7
|
|
| BLAKE2b-256 |
587a12ddb32c8636288a23d267ce4baf2057131d7912508bd2e6685777106d17
|