Git-like deterministic checkpointing for ML training
Project description
Gradient
Git-like deterministic checkpointing for ML training with anchor + delta checkpoints, forkable branches, and a workspace/repo hierarchy.
Highlights
- Anchor + delta checkpointing to reduce storage by up to 80%.
- Deterministic resume (model state, RNG, optimizer, scheduler).
- Branching and forking from any checkpoint ref.
- Workspace/repo hierarchy for organizing multiple models.
- Auto-create mode - just specify workspace + repo, everything is created automatically.
- Git-style CLI with workspace and repo management commands.
- Manifest-based run metadata for dashboards and tooling.
Install
pip install gradient-desc
Required dependency: torch.
Quick Start
Zero Setup (Auto-Create Mode)
The simplest way to get started - just specify a workspace and repo name:
import torch
import torch.nn as nn
import torch.optim as optim
from gradient import GradientEngine
model = nn.Linear(4, 1)
opt = optim.Adam(model.parameters(), lr=1e-3)
# Both workspace and repo are auto-created!
engine = GradientEngine.attach(
model, opt,
workspace="./my_workspace",
repo="my_model"
)
engine.autocommit(every=5)
start = engine.current_step
for step in range(start + 1, start + 21):
loss = (model(torch.randn(32, 4)) ** 2).mean()
loss.backward()
opt.step()
opt.zero_grad(set_to_none=True)
engine.maybe_commit(step)
CLI-Initialized Workflow
For more control, initialize workspace and repo explicitly:
# Initialize workspace
gradient workspace init ./ml-experiments
# Create a repo for your model
cd ml-experiments
gradient repo init gpt4 --description "GPT-4 training runs"
# Check status
gradient workspace status
Then in your training script:
from gradient import GradientEngine
# Auto-discovers workspace/repo from current directory
engine = GradientEngine.attach(model, optimizer)
Workspace/Repo Hierarchy
Gradient organizes checkpoints in a Git-like hierarchy:
my_workspace/ # Workspace (contains multiple repos)
├── .gradient/ # Workspace marker
│ └── config.json
├── gpt4/ # Repo (one model)
│ ├── .gradient-repo/ # Repo marker
│ │ └── config.json
│ ├── manifest.json
│ ├── ckpt_main_s0.pt
│ └── ckpt_main_s100.pt
└── llama/ # Another repo
└── ...
- Workspace: Contains multiple repos (one per model/project)
- Repo: Contains branches and checkpoints for a single model
CLI
Workspace Commands
gradient workspace init [path] # Initialize a new workspace
gradient workspace status # Show all repos in workspace
Repo Commands
gradient repo init <name> [-d DESC] # Create a new repo in workspace
gradient repo list # List all repos
Auth Commands
gradient login [--token TOKEN] [--verify-url URL]
gradient login verifies your access token with the remote auth endpoint, stores
the token in your OS keyring (gradient-cli service), and writes non-secret
session metadata to ~/.gradient/auth.json.
Training Commands
gradient status # Show current repo status
gradient resume <ref> -- python train.py
gradient fork <from_ref> <new_branch> [--reset-optimizer] [--seed N] -- python train.py
Checkpoint Refs
Refs use the format branch@step:
main@100- step 100 on main branchexperiment@50- step 50 on experiment branchlatest- most recent checkpoint on current branch
Environment Variables
Set by the CLI for training script handoff:
GRADIENT_WORKSPACE: workspace pathGRADIENT_REPO: repo nameGRADIENT_RESUME_REF: checkpoint ref to resume fromGRADIENT_BRANCH: branch name overrideGRADIENT_AUTOCOMMIT: auto-commit interval
Optional override for gradient login:
GRADIENT_AUTH_VERIFY_URL: token verification endpoint URL (defaults to production endpoint)
Public API
Import Surface
from gradient import (
GradientEngine,
GradientConfig,
# Workspace/Repo management
WorkspaceConfig,
RepoConfig,
init_workspace,
init_repo,
find_workspace,
find_repo,
resolve_context,
)
GradientEngine.attach
Attach to a model and optimizer for checkpointing:
# Auto-create mode (simplest)
engine = GradientEngine.attach(
model, optimizer,
workspace="./my_workspace",
repo="my_model"
)
# Auto-discover from current directory
engine = GradientEngine.attach(model, optimizer)
# With explicit config
engine = GradientEngine.attach(
model, optimizer,
scheduler=lr_scheduler,
config=GradientConfig(
workspace_path="./my_workspace",
repo_name="my_model",
branch="experiment",
)
)
Behavior:
- Auto-creates workspace and repo if both are explicitly provided
- Auto-discovers from current directory if inside an initialized repo
- Respects CLI environment variables for handoff
- Creates
manifest.jsonon first attach
Checkpoint Operations
engine.commit(step, message="") # Write checkpoint (anchor or delta)
engine.resume("main@100") # Resume from ref
engine.resume_latest() # Resume latest on current branch
engine.fork(
from_ref="main@100",
new_branch="experiment",
reset_optimizer=False,
reset_scheduler=False,
reset_rng_seed=None,
message=""
)
Training + Commit Patterns
# Periodic auto-commit
engine.autocommit(every=10)
start = engine.current_step
for step in range(start + 1, start + 1001):
loss = train_step(...)
engine.maybe_commit(step)
# Manual milestone commits
for step in range(start + 1, start + 501):
loss = train_step(...)
if step in {1, 50, 100, 250, 500}:
engine.commit(step, message=f"milestone step {step}")
Training Loop Helpers
engine.autocommit(every=10) # Set auto-commit interval
engine.maybe_commit(step) # Commit if step matches interval
engine.current_step # Step resumed from (0 for fresh run)
Properties
engine.workspace_path # Path to workspace
engine.repo_name # Current repo name
engine.repo_path # Full path to repo
engine.branch # Current branch name
Extensibility
Register external state (RL envs, curriculum, etc.):
engine.register_state(
"env_state",
getter=lambda: env.get_state(),
setter=lambda s: env.set_state(s)
)
GradientConfig
GradientConfig(
workspace_path="./my_workspace",
repo_name="my_model",
branch="main",
reanchor_interval=None,
compression="auto", # "off" | "auto" | "aggressive"
)
Notes:
reanchor_interval: force new anchor after N delta checkpointscompression: lightweight delta compression mode
Manifest Format
manifest.json is created in each repo and updated on every commit:
{
"repo_name": "my_model",
"checkpoints": [
{
"step": 10,
"branch": "main",
"file": "ckpt_main_s10.pt",
"type": "delta"
}
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gradient_desc-0.1.13.tar.gz.
File metadata
- Download URL: gradient_desc-0.1.13.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c9b6007d71e77b4a7b8f29f07b3de21d5864146d1549af3af151799883c858f
|
|
| MD5 |
5d52b7e73f052cdc325ee8634407587e
|
|
| BLAKE2b-256 |
7f3063272f0c70b075aa4bd67a341ef947f1907d256b5f554ac73b0c513e2178
|
Provenance
The following attestation bundles were made for gradient_desc-0.1.13.tar.gz:
Publisher:
publish.yml on malhar2805/Gradient
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gradient_desc-0.1.13.tar.gz -
Subject digest:
6c9b6007d71e77b4a7b8f29f07b3de21d5864146d1549af3af151799883c858f - Sigstore transparency entry: 1113461298
- Sigstore integration time:
-
Permalink:
malhar2805/Gradient@d56b2100c98faa4ffba3f7a40d1acaf90d1691ed -
Branch / Tag:
refs/tags/v0.1.13 - Owner: https://github.com/malhar2805
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d56b2100c98faa4ffba3f7a40d1acaf90d1691ed -
Trigger Event:
push
-
Statement type:
File details
Details for the file gradient_desc-0.1.13-py3-none-any.whl.
File metadata
- Download URL: gradient_desc-0.1.13-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35bdfccba1785a310c5f68d44d558acca91d1c309088cd6fd9ddfac4556ef7bf
|
|
| MD5 |
84ca7cdbd125cce59c0e3ed3d096c040
|
|
| BLAKE2b-256 |
239a3ae3d1afd88b27388e29d998cc58f47c2a845a8f320d79839a07e00600a7
|
Provenance
The following attestation bundles were made for gradient_desc-0.1.13-py3-none-any.whl:
Publisher:
publish.yml on malhar2805/Gradient
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gradient_desc-0.1.13-py3-none-any.whl -
Subject digest:
35bdfccba1785a310c5f68d44d558acca91d1c309088cd6fd9ddfac4556ef7bf - Sigstore transparency entry: 1113461339
- Sigstore integration time:
-
Permalink:
malhar2805/Gradient@d56b2100c98faa4ffba3f7a40d1acaf90d1691ed -
Branch / Tag:
refs/tags/v0.1.13 - Owner: https://github.com/malhar2805
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d56b2100c98faa4ffba3f7a40d1acaf90d1691ed -
Trigger Event:
push
-
Statement type: