Lightweight artifact integrity and drift detection for ML/data pipelines
Project description
trakr
Artifact integrity and drift detection for ML and data pipelines.
A small CLI that hashes the things you care about, snapshots them alongside your environment, and yells when anything moves. That's it.
Why this exists
I kept getting bitten by the same thing: a model I trained two weeks ago behaves slightly differently in prod than it did on my laptop. A dataset "didn't change", except the file modified-time says it did. Someone in staging swapped a config and we noticed three days later.
The tools I had didn't help much. DVC versions things, MLflow tracks experiments, git-lfs stores blobs — none of them answer the simple question "is this file the same one I snapshotted last Tuesday?".
trakr is the boring, 500-line answer to that. SHA-256 a file. Save the hash and the env. Compare them later. Exit non-zero if anything moved.
It is meant to live next to your existing pipeline tools, not replace them.
How it compares
| trakr | DVC | MLflow | git-lfs | W&B | |
|---|---|---|---|---|---|
| SHA-256 integrity verification | ✅ | ||||
| Drift diff between runs | ✅ | ||||
| Python + package env capture | ✅ | ✅ | ✅ | ||
| S3 (no-download ETag check) | ✅ | ✅ | ✅ | ✅ | |
| Zero config to start | ✅ | ||||
| Non-zero exit on drift (CI-friendly) | ✅ | ✅ | |||
| JSON output | ✅ | ✅ | ✅ | ||
| Pre-commit hook | ✅ | ✅ | ✅ | ||
| Runs without a server or DB | ✅ | ✅ | ✅ | ||
| 3 dependencies, no daemon | ✅ | ✅ | |||
| Dataset / model versioning | ✅ | ✅ | ✅ | ✅ | |
| Experiment tracking & metrics | ✅ | ✅ | |||
| Model registry | ✅ | ✅ | |||
| Visualization dashboards | ✅ | ✅ |
Use DVC to version, MLflow to track experiments, W&B for dashboards. Use trakr to make sure the artifacts they point at are the ones you actually expect.
Install
pip install trakr
For S3 support:
pip install "trakr[s3]"
Needs Python 3.10 or newer.
Quickstart
trakr init
trakr track model.pkl --name model --type model
trakr track data/train.csv --name training --type dataset
trakr snapshot
trakr verify
That's the whole thing. init creates a .trakr/ directory in your repo,
track registers a path under an alias, snapshot writes a manifest of the
current state, and verify re-hashes everything and compares.
When something drifts, verify exits 1 and prints what changed:
Verify against run_2026-04-23-001
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Artifact ┃ Status ┃ Detail ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ model │ ✗ mismatch │ hash changed │
│ training │ ✓ verified │ hash match │
└──────────┴────────────┴──────────────┘
✗ Drift detected — some artifacts do not match the latest snapshot.
That non-zero exit is the whole point — drop trakr verify into a CI step
and a drifted artifact stops the pipeline.
Commands
trakr init
Creates .trakr/ (config, manifests folder, cache folder). Idempotent up to
the point that running it twice is an error, on purpose — if .trakr/ is
already there, you probably don't want to clobber it.
trakr track <path> --name <alias> [--type <type>]
Register a path. Local files are hashed; s3://bucket/key paths are tracked
by ETag and size (no download). The --type is a free-form label —
model, dataset, config, whatever you find useful in the diff later.
Re-running with the same --name overwrites the entry, which is convenient
when you're iterating on what to track.
trakr snapshot
Writes .trakr/manifests/run_<id>.yaml with:
- the current hash + size of each tracked local file
- ETag + size + last-modified for each S3 object
- the Python version and every installed package
- a timestamp and an auto-incrementing run id (
YYYY-MM-DD-NNN)
Files larger than 10 MB get a progress bar while they're being hashed.
trakr verify [--json]
Re-runs the same collection against the latest manifest and compares.
Prints a table by default. With --json it prints a machine-readable
result, useful in CI:
trakr verify --json | jq '.status'
# "ok" or "drift"
Exit code is 0 on a clean verify, 1 if anything mismatched.
trakr diff <run1> <run2>
Tree view of what changed between two snapshots — added, removed, changed artifacts, plus environment differences (Python version, package versions).
trakr list [--json]
Shows what you're currently tracking, with the last-known hash from the latest snapshot if there is one.
trakr history [--limit N]
Recent snapshots, newest first. Default limit is 20.
trakr status [--json]
Quick summary panel: how many artifacts you track, when the last snapshot was, whether anything has drifted since. Cheap to run, good for a shell prompt or a Makefile target.
trakr untrack <name>
Stop tracking an artifact. Doesn't delete anything from disk or from old
manifests, just removes it from config.yaml.
Configuration
.trakr/config.yaml is plain YAML and is meant to be hand-edited:
pipeline: default # free-form label, shown in manifests
hash_algorithm: sha256 # reserved; sha256 is the only one wired up today
log_level: info # default; CLI flags override
artifacts:
- name: model
path: model.pkl
type: model
- name: training-data
path: s3://my-bucket/data.csv
type: dataset
Environment variables
| Variable | What it does |
|---|---|
TRAKR_DIR |
Use a different directory than ./.trakr/. |
TRAKR_LOG_LEVEL |
debug / info / warning / error. |
TRAKR_NO_COLOR |
Disable colored output. (Also respects standard NO_COLOR.) |
AWS_* |
Whatever boto3 uses — credentials, region, profile. |
Global flags
trakr --version
trakr -v <cmd> # debug logging
trakr -q <cmd> # quiet (warnings only)
trakr --trakr-dir /path # custom .trakr/ location
These go before the subcommand: trakr -v snapshot, not trakr snapshot -v.
CI integration
GitHub Actions
name: verify-artifacts
on: [push, pull_request]
jobs:
trakr:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.12" }
- run: pip install trakr
- run: trakr verify
Pre-commit
repos:
- repo: https://github.com/your-org/trakr
rev: v0.1.0
hooks:
- id: trakr-verify
GitLab CI
verify-artifacts:
image: python:3.12
script:
- pip install trakr
- trakr verify
Building from source
git clone https://github.com/your-org/trakr.git
cd trakr
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,s3]"
pytest # 30 tests
ruff check src/ tests/
The layout is small:
src/trakr/
cli/ # typer commands and rich rendering — the user-facing layer
core/ # hashing, manifest, environment — pure logic, no UI
handlers/ # one module per artifact source (local, s3)
Adding a new handler is roughly: drop a file in handlers/ that exposes
get_artifact_info(path) -> dict, then teach _get_handler in cli/commands.py
to recognize the new prefix. Tests welcome.
Contributing
PRs welcome. Read CONTRIBUTING.md for the dev setup, coding style (ruff), and what we look for in a PR.
The project is intentionally small. If a feature requires a server, a daemon, or a fourth dependency, it probably belongs in a downstream tool — open an issue first so we can talk it through.
By participating you agree to the Code of Conduct.
Roadmap
Things on the list, in roughly the order I'd reach for them:
trakr doctor— diagnose common setup problems- glob patterns in
trakr track - BLAKE3 and SHA-512 as alternative hash algorithms
- remote manifest storage (S3/GCS backend)
trakr verify --run <id>to verify against a specific runtrakr show <run_id>to pretty-print a single manifest- a
demo/directory with an end-to-end sample pipeline - a published GitHub Action
These are all reasonable starter PRs — open an issue if you want to take one.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trakr-0.1.0.tar.gz.
File metadata
- Download URL: trakr-0.1.0.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3fa464b19b99df6fd6ce38b21e66382a0eceb696da338ba711a1c09fbeb16f4
|
|
| MD5 |
17ca1eb2e96b5964d1fd25a9e86f12b8
|
|
| BLAKE2b-256 |
56a326a6ea773d3b391add32ea37d6a50b9deef7c2b735a3b6c0dc81279b2471
|
Provenance
The following attestation bundles were made for trakr-0.1.0.tar.gz:
Publisher:
publish.yml on poorna-prakash-sr/trakr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trakr-0.1.0.tar.gz -
Subject digest:
b3fa464b19b99df6fd6ce38b21e66382a0eceb696da338ba711a1c09fbeb16f4 - Sigstore transparency entry: 1381004449
- Sigstore integration time:
-
Permalink:
poorna-prakash-sr/trakr@a7202b9a70f382ecad1c0601d0363411a998fa41 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/poorna-prakash-sr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a7202b9a70f382ecad1c0601d0363411a998fa41 -
Trigger Event:
push
-
Statement type:
File details
Details for the file trakr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: trakr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83418fbb069a95919a9f92e0dfd6a81fd0f46d3727f9b1b5cb72f0fa9f86ecf0
|
|
| MD5 |
63e4da3f40df39cdc11c9397029e8bbc
|
|
| BLAKE2b-256 |
f736ec4233b7a99924a9f07c383e6f8ad54c73d3f16cb742edf174527acebffc
|
Provenance
The following attestation bundles were made for trakr-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on poorna-prakash-sr/trakr
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trakr-0.1.0-py3-none-any.whl -
Subject digest:
83418fbb069a95919a9f92e0dfd6a81fd0f46d3727f9b1b5cb72f0fa9f86ecf0 - Sigstore transparency entry: 1381004539
- Sigstore integration time:
-
Permalink:
poorna-prakash-sr/trakr@a7202b9a70f382ecad1c0601d0363411a998fa41 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/poorna-prakash-sr
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a7202b9a70f382ecad1c0601d0363411a998fa41 -
Trigger Event:
push
-
Statement type: