Skip to main content

End-to-end data provenance for machine learning pipelines

Project description

Traceprop

End-to-end data provenance for machine learning pipelines.

Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.

pip install traceprop

PyPI Python License


What it does

A single Traceprop query answers:

"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction — and can we reduce that influence without retraining?"

Capability What you get
Lineage tracking Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation
Attribution LDS 0.622 ± 0.180 on tabular data at 0.22 s CPU — matches TRAK quality, no GPU needed
Approximate unlearning Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap
Compliance reporting Structured JSON audit trail for EU AI Act Article 26 obligations
Data valuation KNN-Shapley values aggregated by source file and preprocessing op

Installation

# Core (NumPy only)
pip install traceprop

# With PyTorch support
pip install "traceprop[torch]"

# With JAX support
pip install "traceprop[jax]"

# With PostgreSQL provenance store
pip install "traceprop[postgres]"

# Everything
pip install "traceprop[all]"

Requires Python 3.10+.


Quick start

import traceprop as tp
import numpy as np

# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")

# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)

# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
    train(model, X_train, y_train)   # your training loop here

# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)

for entry in result.top(5):
    print(entry["source_id"], entry["sample_index"], entry["influence_score"])

# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])

# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",
    n_steps=300,
    lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")

# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
    tensor=norm_a,
    system_name="CreditScorer-v1",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="compliance_report.json",
)

Core API

Provenance tracking

Function Description
tp.from_numpy(arr, source_id=...) Wrap a NumPy array with lineage tracking
tp.from_csv(path, source_id=...) Load CSV with lineage tracking
tp.from_torch(data, source_id=...) Wrap a PyTorch tensor
tp.from_jax(data, source_id=...) Wrap a JAX array
tp.array(data, source_id=...) Like np.array but tracked
tp.provenance(tensor) Get a ProvenanceView to query lineage
tp.reset_graph() Start a fresh lineage graph

ProvenanceView

view = tp.provenance(tensor)
view.ancestors()      # set of ancestor node IDs
view.ops()            # list of preprocessing operations
view.sources()        # list of source_ids in lineage

Attribution

# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
    ...  # training loop

# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)

result.top(10)            # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0)   # trace rank-0 sample to source file + ops
result.by_source()        # aggregate influence by source_id

GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.

Unlearning

result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",   # data source to forget
    n_steps=300,
    lr=1e-2,
    verification_threshold=0.05,
)
result.verified             # bool
result.influence_before     # float
result.influence_after      # float
result.compliance_report    # dict

Data valuation

val_result = tp.data_valuation(
    gradient_store=ctx.gradient_store,
    val_gradients=val_grads,   # (n_val, grad_dim) array
    k=10,
)
val_result.by_source()    # Shapley values aggregated by source
val_result.by_op()        # Shapley values aggregated by preprocessing op

Compliance

report = tp.compliance_report(
    tensor=output_tensor,
    system_name="MyModel",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="report.json",   # optional: write to file
)

Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).

Granularity modes

tp.set_granularity(tp.Granularity.OP)      # default: track every op
tp.set_granularity(tp.Granularity.BATCH)   # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH)   # epoch-level only

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Higher is better. Measured on 500 held-out retraining subsets.

Method Dataset LDS Time Hardware
Traceprop-LL Adult Income (tabular) 0.622 ± 0.180 0.22 s CPU
TRAK (5 ckpts) CIFAR-2 / ResNet-9 0.0290 ± 0.0523 691 s GPU (T4)
Traceprop-LL CIFAR-2 / ResNet-9 0.0168 ± 0.0684 2.6 s CPU
Traceprop-BM CIFAR-2 / ResNet-9 0.0033 ± 0.0334 14.2 s CPU
Random CIFAR-2 / ResNet-9 0.0205 ± 0.0357

Recommendation: use Traceprop-LL for tabular and linear models (it is exact for logistic regression). For deep vision models with BatchNorm, TRAK is preferred for quality; Traceprop-LL is 266× faster but scores near random on CIFAR-2 due to BatchNorm corrupting per-sample last-layer features.

Lineage overhead

Platform Overhead Mode
macOS (M-series) 1.007× op-mode
Linux (x86-64) 0.979× op-mode

Sub-1% overhead at 10⁶+ array elements.

Unlearning

Forget-set loss after gradient correction: 0.425 vs. gold standard (retrain from scratch): 0.401 vs. original: 0.379. Gap closed: >100% (exceeds retrain-from-scratch). Test accuracy drop: −0.5 pp (0.915 vs. 0.920).


Backends

Backend Install Usage
NumPy built-in tp.from_numpy(arr)
PyTorch pip install "traceprop[torch]" tp.from_torch(tensor)
JAX pip install "traceprop[jax]" tp.from_jax(array)

Provenance stores

By default Traceprop uses an in-memory store. For persistence:

# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")

# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")

Examples


Project structure

traceprop/
  __init__.py            # public API
  tensor.py              # ProvenanceTensor (NumPy wrapper)
  graph.py               # lineage DAG
  query.py               # ProvenanceView
  interceptor.py         # op-level interception
  granularity.py         # Granularity modes
  compression.py         # ProvRC range compression
  exporters.py           # Parquet / OpenTelemetry exporters
  exceptions.py
  attribution/
    training_context.py  # TrainingContext, GradientStore
    gradient_store.py    # sparse JL projection
    influence.py         # compute_influence_scores
    attribution_engine.py
    streaming_context.py # online / continual learning
  backends/
    numpy_backend.py
    torch_backend.py
    jax_backend.py
  stores/
    memory_store.py
    sqlite_store.py
    postgres_store.py
  compliance/
    eu_ai_act.py         # EU AI Act Article 26 report generator
  unlearning/
    gradient_correction.py
  valuation/
    knn_shapley.py
  _c_ext/
    graph_ops.pyx        # optional Cython acceleration

Contributing

Issues and pull requests are welcome. Please open an issue before submitting a large PR.

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest

Citation

If you use Traceprop in research, please cite:

@misc{traceprop2025,
  author  = {Amit N.},
  title   = {Traceprop: End-to-End Data Provenance for Machine Learning Pipelines},
  year    = {2025},
  url     = {https://github.com/AmitoVrito/Traceprop},
}

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceprop-0.5.0.tar.gz (182.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceprop-0.5.0-py3-none-any.whl (49.8 kB view details)

Uploaded Python 3

File details

Details for the file traceprop-0.5.0.tar.gz.

File metadata

  • Download URL: traceprop-0.5.0.tar.gz
  • Upload date:
  • Size: 182.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for traceprop-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ec5eef3b8e583197b123daff2fc89c006814d44f5b034f54c0a1c24cf86c2d37
MD5 36a018b946b36516f43df0f3d5c2e637
BLAKE2b-256 f3f21f3485da70e8572a855ff2257449eafa42aae82cff5f74e61b3dcc5b27af

See more details on using hashes here.

File details

Details for the file traceprop-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: traceprop-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 49.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for traceprop-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7ff19f08bee62be234b247940e1cd718aa60e2bd354af483914cf427bfa00f3
MD5 89efcd103044c6d1740372e9fb313b88
BLAKE2b-256 ac304fd399d362953f2ceda6fc1e034589be170991a00311e18f1a4f0c44d4fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page