End-to-end data provenance for machine learning pipelines
Project description
Traceprop
End-to-end data provenance for machine learning pipelines.
Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.
pip install traceprop
What it does
A single Traceprop query answers:
"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction — and can we reduce that influence without retraining?"
| Capability | What you get |
|---|---|
| Lineage tracking | Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation |
| Attribution | LDS 0.622 ± 0.180 on tabular data at 0.22 s CPU — matches TRAK quality, no GPU needed |
| Approximate unlearning | Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap |
| Compliance reporting | Structured JSON audit trail for EU AI Act Article 26 obligations |
| Data valuation | KNN-Shapley values aggregated by source file and preprocessing op |
Installation
# Core (NumPy only)
pip install traceprop
# With PyTorch support
pip install "traceprop[torch]"
# With JAX support
pip install "traceprop[jax]"
# With PostgreSQL provenance store
pip install "traceprop[postgres]"
# Everything
pip install "traceprop[all]"
Requires Python 3.10+.
Quick start
import traceprop as tp
import numpy as np
# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")
# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)
# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
train(model, X_train, y_train) # your training loop here
# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)
for entry in result.top(5):
print(entry["source_id"], entry["sample_index"], entry["influence_score"])
# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])
# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
gradient_store=ctx.gradient_store,
source_id="hospital_a",
n_steps=300,
lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")
# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
tensor=norm_a,
system_name="CreditScorer-v1",
system_version="1.0.0",
deployer_name="Amit N.",
high_risk_category="credit_scoring",
output_path="compliance_report.json",
)
Core API
Provenance tracking
| Function | Description |
|---|---|
tp.from_numpy(arr, source_id=...) |
Wrap a NumPy array with lineage tracking |
tp.from_csv(path, source_id=...) |
Load CSV with lineage tracking |
tp.from_torch(data, source_id=...) |
Wrap a PyTorch tensor |
tp.from_jax(data, source_id=...) |
Wrap a JAX array |
tp.array(data, source_id=...) |
Like np.array but tracked |
tp.provenance(tensor) |
Get a ProvenanceView to query lineage |
tp.reset_graph() |
Start a fresh lineage graph |
ProvenanceView
view = tp.provenance(tensor)
view.ancestors() # set of ancestor node IDs
view.ops() # list of preprocessing operations
view.sources() # list of source_ids in lineage
Attribution
# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
... # training loop
# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)
result.top(10) # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0) # trace rank-0 sample to source file + ops
result.by_source() # aggregate influence by source_id
GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.
Unlearning
result = tp.unlearn(
gradient_store=ctx.gradient_store,
source_id="hospital_a", # data source to forget
n_steps=300,
lr=1e-2,
verification_threshold=0.05,
)
result.verified # bool
result.influence_before # float
result.influence_after # float
result.compliance_report # dict
Data valuation
val_result = tp.data_valuation(
gradient_store=ctx.gradient_store,
val_gradients=val_grads, # (n_val, grad_dim) array
k=10,
)
val_result.by_source() # Shapley values aggregated by source
val_result.by_op() # Shapley values aggregated by preprocessing op
Compliance
report = tp.compliance_report(
tensor=output_tensor,
system_name="MyModel",
system_version="1.0.0",
deployer_name="Amit N.",
high_risk_category="credit_scoring",
output_path="report.json", # optional: write to file
)
Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).
Granularity modes
tp.set_granularity(tp.Granularity.OP) # default: track every op
tp.set_granularity(tp.Granularity.BATCH) # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH) # epoch-level only
Benchmarks
Attribution quality (LDS — Linear Datamodeling Score)
Higher is better. Measured on 500 held-out retraining subsets.
| Method | Dataset | LDS | Time | Hardware |
|---|---|---|---|---|
| Traceprop-LL | Adult Income (tabular) | 0.622 ± 0.180 | 0.22 s | CPU |
| TRAK (5 ckpts) | CIFAR-2 / ResNet-9 | 0.0290 ± 0.0523 | 691 s | GPU (T4) |
| Traceprop-LL | CIFAR-2 / ResNet-9 | 0.0168 ± 0.0684 | 2.6 s | CPU |
| Traceprop-BM | CIFAR-2 / ResNet-9 | 0.0033 ± 0.0334 | 14.2 s | CPU |
| Random | CIFAR-2 / ResNet-9 | 0.0205 ± 0.0357 | — | — |
Recommendation: use Traceprop-LL for tabular and linear models (it is exact for logistic regression). For deep vision models with BatchNorm, TRAK is preferred for quality; Traceprop-LL is 266× faster but scores near random on CIFAR-2 due to BatchNorm corrupting per-sample last-layer features.
Lineage overhead
| Platform | Overhead | Mode |
|---|---|---|
| macOS (M-series) | 1.007× | op-mode |
| Linux (x86-64) | 0.979× | op-mode |
Sub-1% overhead at 10⁶+ array elements.
Unlearning
Forget-set loss after gradient correction: 0.425 vs. gold standard (retrain from scratch): 0.401 vs. original: 0.379. Gap closed: >100% (exceeds retrain-from-scratch). Test accuracy drop: −0.5 pp (0.915 vs. 0.920).
Backends
| Backend | Install | Usage |
|---|---|---|
| NumPy | built-in | tp.from_numpy(arr) |
| PyTorch | pip install "traceprop[torch]" |
tp.from_torch(tensor) |
| JAX | pip install "traceprop[jax]" |
tp.from_jax(array) |
Provenance stores
By default Traceprop uses an in-memory store. For persistence:
# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")
# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")
Examples
examples/full_pipeline_demo.py— full end-to-end demo: two hospital CSVs → preprocessing → training → attribution → unlearning → compliance reportnotebooks/tabular_logistic_lds_colab.ipynb— LDS benchmark on Adult Income (Colab, CPU)notebooks/cifar2_resnet9_lds_colab.ipynb— LDS benchmark on CIFAR-2/ResNet-9 (Colab, GPU T4)notebooks/homecredit_multisource_provenance_colab.ipynb— multi-source provenance case study (3-table credit risk data)
Project structure
traceprop/
__init__.py # public API
tensor.py # ProvenanceTensor (NumPy wrapper)
graph.py # lineage DAG
query.py # ProvenanceView
interceptor.py # op-level interception
granularity.py # Granularity modes
compression.py # ProvRC range compression
exporters.py # Parquet / OpenTelemetry exporters
exceptions.py
attribution/
training_context.py # TrainingContext, GradientStore
gradient_store.py # sparse JL projection
influence.py # compute_influence_scores
attribution_engine.py
streaming_context.py # online / continual learning
backends/
numpy_backend.py
torch_backend.py
jax_backend.py
stores/
memory_store.py
sqlite_store.py
postgres_store.py
compliance/
eu_ai_act.py # EU AI Act Article 26 report generator
unlearning/
gradient_correction.py
valuation/
knn_shapley.py
_c_ext/
graph_ops.pyx # optional Cython acceleration
Contributing
Issues and pull requests are welcome. Please open an issue before submitting a large PR.
git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest
Citation
If you use Traceprop in research, please cite:
@misc{traceprop2025,
author = {Amit N.},
title = {Traceprop: End-to-End Data Provenance for Machine Learning Pipelines},
year = {2025},
url = {https://github.com/AmitoVrito/Traceprop},
}
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceprop-0.5.0.tar.gz.
File metadata
- Download URL: traceprop-0.5.0.tar.gz
- Upload date:
- Size: 182.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec5eef3b8e583197b123daff2fc89c006814d44f5b034f54c0a1c24cf86c2d37
|
|
| MD5 |
36a018b946b36516f43df0f3d5c2e637
|
|
| BLAKE2b-256 |
f3f21f3485da70e8572a855ff2257449eafa42aae82cff5f74e61b3dcc5b27af
|
File details
Details for the file traceprop-0.5.0-py3-none-any.whl.
File metadata
- Download URL: traceprop-0.5.0-py3-none-any.whl
- Upload date:
- Size: 49.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7ff19f08bee62be234b247940e1cd718aa60e2bd354af483914cf427bfa00f3
|
|
| MD5 |
89efcd103044c6d1740372e9fb313b88
|
|
| BLAKE2b-256 |
ac304fd399d362953f2ceda6fc1e034589be170991a00311e18f1a4f0c44d4fc
|