Skip to main content

Git-like deterministic checkpointing for ML training

Project description

Gradient

Git-like deterministic checkpointing for ML training with anchor + delta checkpoints, forkable branches, and a workspace/repo hierarchy.

Watch the Demo Video

Visit the site

Highlights

  • Anchor + delta checkpointing to reduce storage by up to 80%.
  • Deterministic resume (model state, RNG, optimizer, scheduler).
  • Branching and forking from any checkpoint ref.
  • Workspace/repo hierarchy for organizing multiple models.
  • Auto-create mode - just specify workspace + repo, everything is created automatically.
  • Git-style CLI with workspace and repo management commands.
  • Manifest-based run metadata for dashboards and tooling.

Install

pip install gradient-desc

Required dependency: torch.

Quick Start

Zero Setup (Auto-Create Mode)

The simplest way to get started - just specify a workspace and repo name:

import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine

model = nn.Linear(4, 1)
opt = optim.Adam(model.parameters(), lr=1e-3)

# Both workspace and repo are auto-created!
engine = GradientEngine.attach(
    model, opt,
    workspace="./my_workspace",
    repo="my_model"
)
engine.autocommit(every=5)

start = engine.current_step
for step in range(start + 1, start + 21):
    loss = (model(torch.randn(32, 4)) ** 2).mean()
    loss.backward()
    opt.step()
    opt.zero_grad(set_to_none=True)
    engine.maybe_commit(step)

CLI-Initialized Workflow

For more control, initialize workspace and repo explicitly:

# Initialize workspace
gradient workspace init ./ml-experiments

# Create a repo for your model
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"

# Check status
gradient workspace status

Then in your training script:

from gradient import GradientEngine

# Auto-discovers workspace/repo from current directory
engine = GradientEngine.attach(model, optimizer)

Workspace/Repo Hierarchy

Gradient organizes checkpoints in a Git-like hierarchy:

my_workspace/           # Workspace (contains multiple repos)
├── .gradient/          # Workspace marker
│   └── config.json
├── gpt4/               # Repo (one model)
│   ├── .gradient-repo/ # Repo marker
│   │   └── config.json
│   ├── manifest.json
│   ├── ckpt_main_s0.pt
│   └── ckpt_main_s100.pt
└── llama/              # Another repo
    └── ...
  • Workspace: Contains multiple repos (one per model/project)
  • Repo: Contains branches and checkpoints for a single model

CLI

Workspace Commands

gradient workspace init [path]        # Initialize a new workspace
gradient workspace status             # Show all repos in workspace

Repo Commands

gradient repo init <name> [-d DESC]   # Create a new repo in workspace
gradient repo list                    # List all repos

Training Commands

gradient status                       # Show current repo status
gradient resume <ref> -- python train.py
gradient fork <from_ref> <new_branch> [--reset-optimizer] [--seed N] -- python train.py

Checkpoint Refs

Refs use the format branch@step:

  • main@100 - step 100 on main branch
  • experiment@50 - step 50 on experiment branch
  • latest - most recent checkpoint on current branch

Environment Variables

Set by the CLI for training script handoff:

  • GRADIENT_WORKSPACE: workspace path
  • GRADIENT_REPO: repo name
  • GRADIENT_RESUME_REF: checkpoint ref to resume from
  • GRADIENT_BRANCH: branch name override
  • GRADIENT_AUTOCOMMIT: auto-commit interval

Public API

Import Surface

from gradient import (
    GradientEngine,
    GradientConfig,
    # Workspace/Repo management
    WorkspaceConfig,
    RepoConfig,
    init_workspace,
    init_repo,
    find_workspace,
    find_repo,
    resolve_context,
)

GradientEngine.attach

Attach to a model and optimizer for checkpointing:

# Auto-create mode (simplest)
engine = GradientEngine.attach(
    model, optimizer,
    workspace="./my_workspace",
    repo="my_model"
)

# Auto-discover from current directory
engine = GradientEngine.attach(model, optimizer)

# With explicit config
engine = GradientEngine.attach(
    model, optimizer,
    scheduler=lr_scheduler,
    config=GradientConfig(
        workspace_path="./my_workspace",
        repo_name="my_model",
        branch="experiment",
    )
)

Behavior:

  • Auto-creates workspace and repo if both are explicitly provided
  • Auto-discovers from current directory if inside an initialized repo
  • Respects CLI environment variables for handoff
  • Creates manifest.json on first attach

Checkpoint Operations

engine.commit(step, message="")      # Write checkpoint (anchor or delta)
engine.resume("main@100")            # Resume from ref
engine.resume_latest()               # Resume latest on current branch
engine.fork(
    from_ref="main@100",
    new_branch="experiment",
    reset_optimizer=False,
    reset_scheduler=False,
    reset_rng_seed=None,
    message=""
)

Training Loop Helpers

engine.autocommit(every=10)          # Set auto-commit interval
engine.maybe_commit(step)            # Commit if step matches interval
engine.current_step                  # Step resumed from (0 for fresh run)

Properties

engine.workspace_path                # Path to workspace
engine.repo_name                     # Current repo name
engine.repo_path                     # Full path to repo
engine.branch                        # Current branch name

Extensibility

Register external state (RL envs, curriculum, etc.):

engine.register_state(
    "env_state",
    getter=lambda: env.get_state(),
    setter=lambda s: env.set_state(s)
)

GradientConfig

GradientConfig(
    workspace_path="./my_workspace",
    repo_name="my_model",
    branch="main",
    checkpoint_every=None,
    delta_optimizer=True,
    reanchor_interval=None,
    compression="auto",  # "off" | "auto" | "aggressive"
    strict_resume=True,
)

Notes:

  • delta_optimizer: optimizer deltas (disabled in current implementation for safety)
  • reanchor_interval: force new anchor after N delta checkpoints
  • compression: lightweight delta compression mode

Manifest Format

manifest.json is created in each repo and updated on every commit:

{
  "repo_name": "my_model",
  "checkpoints": [
    {
      "step": 10,
      "branch": "main",
      "file": "ckpt_main_s10.pt",
      "type": "delta"
    }
  ]
}

Demo Scripts

# Full auto-create demo (simplest)
python demo/train_auto.py

# Standard training
python demo/train.py

# Deeper model example
python demo/train_deeper.py

# Forking example
python demo/train_fork.py

Tests

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gradient_desc-0.1.6.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gradient_desc-0.1.6-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file gradient_desc-0.1.6.tar.gz.

File metadata

  • Download URL: gradient_desc-0.1.6.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gradient_desc-0.1.6.tar.gz
Algorithm Hash digest
SHA256 9f2d0ec194513fe4df3ff49c964f5234f5a8a80a704902fd10e61aa62c816690
MD5 6db70f3a9abb92fe29580e3679d1d95d
BLAKE2b-256 8ad6db3e46c4191c91c10db973f5b618ae7a82bcee4467c9489963ce966d56b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for gradient_desc-0.1.6.tar.gz:

Publisher: publish.yml on malhar2805/Gradient

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gradient_desc-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: gradient_desc-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gradient_desc-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 09b24db18f0742dccff82dcf2942a9d86091c68c2a1aebd3b241041849763db3
MD5 3d27261af1524915cb4e52e6c686b873
BLAKE2b-256 4b4bdfa25319ccac49d76d42da1bcb44db5e80189dcdbb4afb16a3715f16f7fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for gradient_desc-0.1.6-py3-none-any.whl:

Publisher: publish.yml on malhar2805/Gradient

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page