Production-grade ML training toolkit with distributed training, GPU profiling, smart checkpointing, and interactive dashboards
Project description
Production-Grade Training Guardrails for PyTorch
Reproducible • Debuggable • Distributed • Efficient
TrainKeeper is a minimal-decision, high-signal toolkit for building robust ML training systems. It adds guardrails inside your training loops without replacing your existing stack (PyTorch, Lightning, Accelerate).
⚡️ Why TrainKeeper?
Most failures happen silently inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.
- 🔒 Zero-Surprise Reproducibility: Automatic seed setting, environment capture, and git state locking.
- 🛡️ Data Integrity: Schema inference and drift detection caught before training wastes GPU hours.
- 🚅 Distributed Made Easy: Auto-configured DDP and FSDP with a single line of code.
- 📉 Resource Efficiency: GPU memory profiling and smart checkpointing that respects disk limits.
📦 Installation
pip install trainkeeper
🚀 Quick Start
Wrap your entry point to effectively "freeze" the experimental conditions:
from trainkeeper.experiment import run_reproducible
@run_reproducible(auto_capture_git=True)
def train():
print("TrainKeeper is running: Experiment is now reproducible.")
if __name__ == "__main__":
train()
✨ Features at a Glance
1. Distributed Training (DDP & FSDP)
Stop fighting with torchrun.
from trainkeeper.distributed import distributed_training, wrap_model_fsdp
with distributed_training() as dist_config:
model = MyModel()
model = wrap_model_fsdp(model, dist_config) # FSDP with auto-wrapping!
2. GPU Memory Profiler
Find leaks and optimize batch sizes automatically.
from trainkeeper.gpu_profiler import GPUProfiler
profiler = GPUProfiler()
profiler.start()
# ... training loop ...
print(profiler.stop().summary())
# Output: "Fragmentation detected (35%). Suggestion: Empty cache at epoch end."
3. Interactive Dashboard
Explore experiments, compare metrics, and analyze drift.
pip install trainkeeper[dashboard]
tk dashboard
🔗 Links
- GitHub Repository: mosh3eb/TrainKeeper
- Full Documentation: Read the Docs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trainkeeper-0.3.0.tar.gz.
File metadata
- Download URL: trainkeeper-0.3.0.tar.gz
- Upload date:
- Size: 74.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7171b2fadea78ce22e81687eb50e71d0d2db7c7020caf4b0b7fb383d19e2efbd
|
|
| MD5 |
e7074dc87632c738cbdaca302f60a46e
|
|
| BLAKE2b-256 |
d0791d585ac7251bb43524c7fc6956838f6e38b478a9b8f0ff35f04deaa6ac7c
|
File details
Details for the file trainkeeper-0.3.0-py3-none-any.whl.
File metadata
- Download URL: trainkeeper-0.3.0-py3-none-any.whl
- Upload date:
- Size: 89.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
101785e81e7f1f90fb7beea341b878f040e3f00f48298b39af67971259da2401
|
|
| MD5 |
c0b79fc1ad6ccf72522eb7419cd78ffb
|
|
| BLAKE2b-256 |
5df2408894183492623fb955e56eb784b620d21faabc6eecc721eb0b5ed7e66e
|