Skip to main content

Minimal-decision tools for reproducible, debuggable training experiments.

Project description

TrainKeeper logo

TrainKeeper

Training-Time System Guardrails for Reliable AI

TrainKeeper is a training-time reliability framework for machine learning systems.
It adds lightweight guardrails around existing training code to make experiments:

  • reproducible
  • debuggable
  • data-safe
  • training-stable
  • and system-verifiable

without replacing your stack.

TrainKeeper focuses on what most frameworks ignore:
👉 what happens inside the training loop.


🚨 Why TrainKeeper exists

Most critical ML failures are silent:

  • non-deterministic experiments
  • unnoticed data corruption or drift
  • exploding / vanishing gradients
  • NaN loss propagation
  • broken resumes and unreproducible results

TrainKeeper turns training into a controlled system rather than a script.

It does this by providing:

  • experiment control
  • data integrity checks
  • training-time instrumentation
  • automatic failure capture
  • and system-level validation scenarios

📦 Install

pip install trainkeeper

Optional extras:

pip install trainkeeper[torch]
pip install trainkeeper[wandb]
pip install trainkeeper[mlflow]

⚡ Quick start

from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    print("TrainKeeper is running.")
    # your normal training loop

if __name__ == "__main__":
    train()

Each run automatically produces:

  • experiment.yaml, run.json
  • system.json, env.txt
  • seeds.json, run.sh
  • checkpoints and failure reports

No pipeline rewrite. No framework lock-in.


🧠 Core runtime modules

Module Purpose
experiment reproducible runs, environment capture, replay
datacheck schema enforcement, drift detection, data profiling
debugger training hooks, instability detection, failure snapshots
trainutils deterministic dataloaders, mixed precision, checkpoints
monitor runtime metrics and behavior tracking
pkg export helpers (ONNX, TorchScript, packaging)

🖥 CLI

tk init
tk run -- python train.py
tk replay <exp-id> -- python train.py
tk compare <exp-a> <exp-b>
tk repro-summary <runs-dir>
tk doctor

The CLI exposes TrainKeeper as a system tool, not just a library.


🧪 System validation (what makes TrainKeeper different)

TrainKeeper is not only a framework.
It is validated through a multi-scenario reliability suite (in the GitHub repo):

Scenario 1 — Reproducibility Lab
Deterministic execution, resume behavior, experiment traceability.

Scenario 2 — Data Corruption Lab
Schema violations, NaNs, label shift, silent distribution drift.

Scenario 3 — Training Robustness Lab
Exploding gradients, NaN loss, optimizer instability, bad batch capture.

These scenarios are orchestrated by a system hardening layer that produces:

  • unified summaries
  • failure matrices
  • cross-scenario system reports

TrainKeeper therefore tests itself.

  • PyPI package = runtime framework only
  • Scenarios & system tests = repository-only

🏗 Architecture

TrainKeeper inserts a guardrail layer between your training code and the system.

User Training Code
        ↓
TrainKeeper Runtime (experiment, datacheck, debugger, trainutils)
        ↓
Structured Artifacts & Reports
        ↓
System Validation Layer (scenarios + system tests)

(Full architecture diagram is available in the GitHub repository.)


🎓 Typical use cases

  • research reproducibility & experiment audits
  • training-time debugging
  • data integrity enforcement
  • reliability testing for ML systems
  • controlled failure experiments
  • AI systems research platforms

🔗 Project links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainkeeper-0.2.3.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trainkeeper-0.2.3-py3-none-any.whl (30.4 kB view details)

Uploaded Python 3

File details

Details for the file trainkeeper-0.2.3.tar.gz.

File metadata

  • Download URL: trainkeeper-0.2.3.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.2.3.tar.gz
Algorithm Hash digest
SHA256 27f8441723ad17aa13b9ad7f16bf8ed087fff150b0492b63fc28bfed43f1f53b
MD5 c1b669942bbd23b7b3c4b8d6eba37dca
BLAKE2b-256 eb923ab231f50a3e452f79a2b4aee2f4022e775f25b46ed64e298a28844852ce

See more details on using hashes here.

File details

Details for the file trainkeeper-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: trainkeeper-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 30.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e6bc1589ba4377d892bf5fd36e617e6329d83d4d3f71fb35cc8beba367c0a81e
MD5 63078e3917ac2085f878825fde4805d7
BLAKE2b-256 587a12ddb32c8636288a23d267ce4baf2057131d7912508bd2e6685777106d17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page