Skip to main content

Lightweight artifact integrity and drift detection for ML/data pipelines

Project description

trakr

Artifact integrity and drift detection for ML and data pipelines.

A small CLI that hashes the things you care about, snapshots them alongside your environment, and yells when anything moves. That's it.


Why this exists

I kept getting bitten by the same thing: a model I trained two weeks ago behaves slightly differently in prod than it did on my laptop. A dataset "didn't change", except the file modified-time says it did. Someone in staging swapped a config and we noticed three days later.

The tools I had didn't help much. DVC versions things, MLflow tracks experiments, git-lfs stores blobs — none of them answer the simple question "is this file the same one I snapshotted last Tuesday?".

trakr is the boring, 500-line answer to that. SHA-256 a file. Save the hash and the env. Compare them later. Exit non-zero if anything moved.

It is meant to live next to your existing pipeline tools, not replace them.

How it compares

trakr DVC MLflow git-lfs W&B
SHA-256 integrity verification
Drift diff between runs
Python + package env capture
S3 (no-download ETag check)
Zero config to start
Non-zero exit on drift (CI-friendly)
JSON output
Pre-commit hook
Runs without a server or DB
3 dependencies, no daemon
Dataset / model versioning
Experiment tracking & metrics
Model registry
Visualization dashboards

Use DVC to version, MLflow to track experiments, W&B for dashboards. Use trakr to make sure the artifacts they point at are the ones you actually expect.

Install

pip install trakr

For S3 support:

pip install "trakr[s3]"

Needs Python 3.10 or newer.

Quickstart

trakr init
trakr track model.pkl --name model --type model
trakr track data/train.csv --name training --type dataset
trakr snapshot
trakr verify

That's the whole thing. init creates a .trakr/ directory in your repo, track registers a path under an alias, snapshot writes a manifest of the current state, and verify re-hashes everything and compares.

When something drifts, verify exits 1 and prints what changed:

  Verify against run_2026-04-23-001
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Artifact ┃ Status     ┃ Detail       ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ model    │ ✗ mismatch │ hash changed │
│ training │ ✓ verified │ hash match   │
└──────────┴────────────┴──────────────┘
✗ Drift detected — some artifacts do not match the latest snapshot.

That non-zero exit is the whole point — drop trakr verify into a CI step and a drifted artifact stops the pipeline.

Commands

trakr init

Creates .trakr/ (config, manifests folder, cache folder). Idempotent up to the point that running it twice is an error, on purpose — if .trakr/ is already there, you probably don't want to clobber it.

trakr track <path> --name <alias> [--type <type>]

Register a path. Local files are hashed; s3://bucket/key paths are tracked by ETag and size (no download). The --type is a free-form label — model, dataset, config, whatever you find useful in the diff later.

Re-running with the same --name overwrites the entry, which is convenient when you're iterating on what to track.

trakr snapshot

Writes .trakr/manifests/run_<id>.yaml with:

  • the current hash + size of each tracked local file
  • ETag + size + last-modified for each S3 object
  • the Python version and every installed package
  • a timestamp and an auto-incrementing run id (YYYY-MM-DD-NNN)

Files larger than 10 MB get a progress bar while they're being hashed.

trakr verify [--json]

Re-runs the same collection against the latest manifest and compares. Prints a table by default. With --json it prints a machine-readable result, useful in CI:

trakr verify --json | jq '.status'
# "ok" or "drift"

Exit code is 0 on a clean verify, 1 if anything mismatched.

trakr diff <run1> <run2>

Tree view of what changed between two snapshots — added, removed, changed artifacts, plus environment differences (Python version, package versions).

trakr list [--json]

Shows what you're currently tracking, with the last-known hash from the latest snapshot if there is one.

trakr history [--limit N]

Recent snapshots, newest first. Default limit is 20.

trakr status [--json]

Quick summary panel: how many artifacts you track, when the last snapshot was, whether anything has drifted since. Cheap to run, good for a shell prompt or a Makefile target.

trakr untrack <name>

Stop tracking an artifact. Doesn't delete anything from disk or from old manifests, just removes it from config.yaml.

Configuration

.trakr/config.yaml is plain YAML and is meant to be hand-edited:

pipeline: default            # free-form label, shown in manifests
hash_algorithm: sha256       # reserved; sha256 is the only one wired up today
log_level: info              # default; CLI flags override
artifacts:
  - name: model
    path: model.pkl
    type: model
  - name: training-data
    path: s3://my-bucket/data.csv
    type: dataset

Environment variables

Variable What it does
TRAKR_DIR Use a different directory than ./.trakr/.
TRAKR_LOG_LEVEL debug / info / warning / error.
TRAKR_NO_COLOR Disable colored output. (Also respects standard NO_COLOR.)
AWS_* Whatever boto3 uses — credentials, region, profile.

Global flags

trakr --version
trakr -v <cmd>          # debug logging
trakr -q <cmd>          # quiet (warnings only)
trakr --trakr-dir /path  # custom .trakr/ location

These go before the subcommand: trakr -v snapshot, not trakr snapshot -v.

CI integration

GitHub Actions

name: verify-artifacts
on: [push, pull_request]
jobs:
  trakr:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install trakr
      - run: trakr verify

Pre-commit

repos:
  - repo: https://github.com/your-org/trakr
    rev: v0.1.0
    hooks:
      - id: trakr-verify

GitLab CI

verify-artifacts:
  image: python:3.12
  script:
    - pip install trakr
    - trakr verify

Building from source

git clone https://github.com/your-org/trakr.git
cd trakr
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,s3]"

pytest          # 30 tests
ruff check src/ tests/

The layout is small:

src/trakr/
  cli/        # typer commands and rich rendering — the user-facing layer
  core/       # hashing, manifest, environment — pure logic, no UI
  handlers/   # one module per artifact source (local, s3)

Adding a new handler is roughly: drop a file in handlers/ that exposes get_artifact_info(path) -> dict, then teach _get_handler in cli/commands.py to recognize the new prefix. Tests welcome.

Contributing

PRs welcome. Read CONTRIBUTING.md for the dev setup, coding style (ruff), and what we look for in a PR.

The project is intentionally small. If a feature requires a server, a daemon, or a fourth dependency, it probably belongs in a downstream tool — open an issue first so we can talk it through.

By participating you agree to the Code of Conduct.

Roadmap

Things on the list, in roughly the order I'd reach for them:

  • trakr doctor — diagnose common setup problems
  • glob patterns in trakr track
  • BLAKE3 and SHA-512 as alternative hash algorithms
  • remote manifest storage (S3/GCS backend)
  • trakr verify --run <id> to verify against a specific run
  • trakr show <run_id> to pretty-print a single manifest
  • a demo/ directory with an end-to-end sample pipeline
  • a published GitHub Action

These are all reasonable starter PRs — open an issue if you want to take one.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trakr-0.1.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trakr-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file trakr-0.1.0.tar.gz.

File metadata

  • Download URL: trakr-0.1.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trakr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b3fa464b19b99df6fd6ce38b21e66382a0eceb696da338ba711a1c09fbeb16f4
MD5 17ca1eb2e96b5964d1fd25a9e86f12b8
BLAKE2b-256 56a326a6ea773d3b391add32ea37d6a50b9deef7c2b735a3b6c0dc81279b2471

See more details on using hashes here.

Provenance

The following attestation bundles were made for trakr-0.1.0.tar.gz:

Publisher: publish.yml on poorna-prakash-sr/trakr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file trakr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: trakr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for trakr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83418fbb069a95919a9f92e0dfd6a81fd0f46d3727f9b1b5cb72f0fa9f86ecf0
MD5 63e4da3f40df39cdc11c9397029e8bbc
BLAKE2b-256 f736ec4233b7a99924a9f07c383e6f8ad54c73d3f16cb742edf174527acebffc

See more details on using hashes here.

Provenance

The following attestation bundles were made for trakr-0.1.0-py3-none-any.whl:

Publisher: publish.yml on poorna-prakash-sr/trakr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page