Skip to main content

Production-grade ML training toolkit with distributed training, GPU profiling, smart checkpointing, and interactive dashboards

Project description

TrainKeeper Logo

PyPI Version Python Versions License

Production-Grade Training Guardrails for PyTorch

Reproducible • Debuggable • Distributed • Efficient


TrainKeeper is a minimal-decision, high-signal toolkit for building robust ML training systems. It adds guardrails inside your training loops without replacing your existing stack (PyTorch, Lightning, Accelerate).

⚡️ Why TrainKeeper?

Most failures happen silently inside execution loops: non-determinism, data drift, unstable gradients, and inconsistent environments. TrainKeeper solves this with zero-config composable modules.

  • 🔒 Zero-Surprise Reproducibility: Automatic seed setting, environment capture, and git state locking.
  • 🛡️ Data Integrity: Schema inference and drift detection caught before training wastes GPU hours.
  • 🚅 Distributed Made Easy: Auto-configured DDP and FSDP with a single line of code.
  • 📉 Resource Efficiency: GPU memory profiling and smart checkpointing that respects disk limits.

📦 Installation

pip install trainkeeper

🚀 Quick Start

Wrap your entry point to effectively "freeze" the experimental conditions:

from trainkeeper.experiment import run_reproducible

@run_reproducible(auto_capture_git=True)
def train():
    print("TrainKeeper is running: Experiment is now reproducible.")

if __name__ == "__main__":
    train()

✨ Features at a Glance

1. Distributed Training (DDP & FSDP)

Stop fighting with torchrun.

from trainkeeper.distributed import distributed_training, wrap_model_fsdp

with distributed_training() as dist_config:
    model = MyModel()
    model = wrap_model_fsdp(model, dist_config)  # FSDP with auto-wrapping!

2. GPU Memory Profiler

Find leaks and optimize batch sizes automatically.

from trainkeeper.gpu_profiler import GPUProfiler

profiler = GPUProfiler()
profiler.start()
# ... training loop ...
print(profiler.stop().summary())
# Output: "Fragmentation detected (35%). Suggestion: Empty cache at epoch end."

3. Interactive Dashboard

Explore experiments, compare metrics, and analyze drift.

pip install trainkeeper[dashboard]
tk dashboard

🔗 Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trainkeeper-0.3.0.tar.gz (74.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trainkeeper-0.3.0-py3-none-any.whl (89.2 kB view details)

Uploaded Python 3

File details

Details for the file trainkeeper-0.3.0.tar.gz.

File metadata

  • Download URL: trainkeeper-0.3.0.tar.gz
  • Upload date:
  • Size: 74.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.3.0.tar.gz
Algorithm Hash digest
SHA256 7171b2fadea78ce22e81687eb50e71d0d2db7c7020caf4b0b7fb383d19e2efbd
MD5 e7074dc87632c738cbdaca302f60a46e
BLAKE2b-256 d0791d585ac7251bb43524c7fc6956838f6e38b478a9b8f0ff35f04deaa6ac7c

See more details on using hashes here.

File details

Details for the file trainkeeper-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: trainkeeper-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 89.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.0

File hashes

Hashes for trainkeeper-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 101785e81e7f1f90fb7beea341b878f040e3f00f48298b39af67971259da2401
MD5 c0b79fc1ad6ccf72522eb7419cd78ffb
BLAKE2b-256 5df2408894183492623fb955e56eb784b620d21faabc6eecc721eb0b5ed7e66e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page