trainkeeper

Production-grade ML training toolkit with distributed training, GPU profiling, smart checkpointing, and interactive dashboards

These details have not been verified by PyPI

Project description

Production-Grade Training Guardrails for PyTorch

Reproducible • Debuggable • Distributed • Efficient

TrainKeeper is a minimal-decision, high-signal toolkit for building robust ML training systems. It adds guardrails inside your training loops without replacing your existing stack (PyTorch, Lightning, Accelerate).

⚡️ Why TrainKeeper?

Most failures happen silently inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.

🔒 Zero-Surprise Reproducibility: Automatic seed setting, environment capture, and git state locking.
🛡️ Data Integrity: Schema inference and drift detection caught before training wastes GPU hours.
🚅 Distributed Made Easy: Auto-configured DDP and FSDP with a single line of code.
📉 Resource Efficiency: GPU memory profiling and smart checkpointing that respects disk limits.

📦 Installation

pip install trainkeeper

🚀 Quick Start

Wrap your entry point to effectively "freeze" the experimental conditions:

from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    print("TrainKeeper is running: Experiment is now reproducible.")

if __name__ == "__main__":
    train()

✨ Features at a Glance

1. Distributed Training (DDP & FSDP)

Stop fighting with torchrun.

from trainkeeper.distributed import distributed_training, wrap_model_fsdp

with distributed_training() as dist_config:
    model = MyModel()
    model = wrap_model_fsdp(model, dist_config)  # FSDP with auto-wrapping!

2. GPU Memory Profiler

Find leaks and optimize batch sizes automatically.

from trainkeeper.gpu_profiler import GPUProfiler

profiler = GPUProfiler()
profiler.start()
# ... training loop ...
print(profiler.stop().summary())
# Output: "Fragmentation detected (35%). Suggestion: Empty cache at epoch end."

3. Interactive Dashboard

Explore experiments, compare metrics, and analyze drift.

pip install trainkeeper[dashboard]
tk dashboard

🔗 Links

GitHub Repository: mosh3eb/TrainKeeper
Full Documentation: Read the Docs

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Feb 18, 2026

0.2.3

Jan 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainkeeper-0.3.0.tar.gz (74.8 kB view details)

Uploaded Feb 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trainkeeper-0.3.0-py3-none-any.whl (89.2 kB view details)

Uploaded Feb 18, 2026 Python 3

File details

Details for the file trainkeeper-0.3.0.tar.gz.

File metadata

Download URL: trainkeeper-0.3.0.tar.gz
Upload date: Feb 18, 2026
Size: 74.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`7171b2fadea78ce22e81687eb50e71d0d2db7c7020caf4b0b7fb383d19e2efbd`
MD5	`e7074dc87632c738cbdaca302f60a46e`
BLAKE2b-256	`d0791d585ac7251bb43524c7fc6956838f6e38b478a9b8f0ff35f04deaa6ac7c`

See more details on using hashes here.

File details

Details for the file trainkeeper-0.3.0-py3-none-any.whl.

File metadata

Download URL: trainkeeper-0.3.0-py3-none-any.whl
Upload date: Feb 18, 2026
Size: 89.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`101785e81e7f1f90fb7beea341b878f040e3f00f48298b39af67971259da2401`
MD5	`c0b79fc1ad6ccf72522eb7419cd78ffb`
BLAKE2b-256	`5df2408894183492623fb955e56eb784b620d21faabc6eecc721eb0b5ed7e66e`

See more details on using hashes here.

trainkeeper 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Production-Grade Training Guardrails for PyTorch

⚡️ Why TrainKeeper?

📦 Installation

🚀 Quick Start

✨ Features at a Glance

1. Distributed Training (DDP & FSDP)

2. GPU Memory Profiler

3. Interactive Dashboard

🔗 Links

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes