End-to-end data provenance for machine learning pipelines

These details have not been verified by PyPI

Project links

Project description

Traceprop

End-to-end data provenance for machine learning pipelines.

Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.

pip install traceprop

What it does

A single Traceprop query answers:

"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction — and can we reduce that influence without retraining?"

Capability	What you get
Lineage tracking	Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation
Attribution	LDS 0.622 ± 0.180 on tabular data at 0.22 s CPU — matches TRAK quality, no GPU needed
Approximate unlearning	Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap
Compliance reporting	Structured JSON audit trail for EU AI Act Article 26 obligations
Data valuation	KNN-Shapley values aggregated by source file and preprocessing op

Installation

# Core (NumPy only)
pip install traceprop

# With PyTorch support
pip install "traceprop[torch]"

# With JAX support
pip install "traceprop[jax]"

# With PostgreSQL provenance store
pip install "traceprop[postgres]"

# Everything
pip install "traceprop[all]"

Requires Python 3.10+.

Quick start

import traceprop as tp
import numpy as np

# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")

# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)

# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
    train(model, X_train, y_train)   # your training loop here

# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)

for entry in result.top(5):
    print(entry["source_id"], entry["sample_index"], entry["influence_score"])

# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])

# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",
    n_steps=300,
    lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")

# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
    tensor=norm_a,
    system_name="CreditScorer-v1",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="compliance_report.json",
)

Core API

Provenance tracking

Function	Description
`tp.from_numpy(arr, source_id=...)`	Wrap a NumPy array with lineage tracking
`tp.from_csv(path, source_id=...)`	Load CSV with lineage tracking
`tp.from_torch(data, source_id=...)`	Wrap a PyTorch tensor
`tp.from_jax(data, source_id=...)`	Wrap a JAX array
`tp.array(data, source_id=...)`	Like `np.array` but tracked
`tp.provenance(tensor)`	Get a `ProvenanceView` to query lineage
`tp.reset_graph()`	Start a fresh lineage graph

ProvenanceView

view = tp.provenance(tensor)
view.ancestors()      # set of ancestor node IDs
view.ops()            # list of preprocessing operations
view.sources()        # list of source_ids in lineage

Attribution

# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
    ...  # training loop

# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)

result.top(10)            # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0)   # trace rank-0 sample to source file + ops
result.by_source()        # aggregate influence by source_id

GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.

Unlearning

result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",   # data source to forget
    n_steps=300,
    lr=1e-2,
    verification_threshold=0.05,
)
result.verified             # bool
result.influence_before     # float
result.influence_after      # float
result.compliance_report    # dict

Data valuation

val_result = tp.data_valuation(
    gradient_store=ctx.gradient_store,
    val_gradients=val_grads,   # (n_val, grad_dim) array
    k=10,
)
val_result.by_source()    # Shapley values aggregated by source
val_result.by_op()        # Shapley values aggregated by preprocessing op

Compliance

report = tp.compliance_report(
    tensor=output_tensor,
    system_name="MyModel",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="report.json",   # optional: write to file
)

Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).

Granularity modes

tp.set_granularity(tp.Granularity.OP)      # default: track every op
tp.set_granularity(tp.Granularity.BATCH)   # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH)   # epoch-level only

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Higher is better. Measured on 500 held-out retraining subsets.

Method	Dataset	LDS	Time	Hardware
Traceprop-LL	Adult Income (tabular)	0.622 ± 0.180	0.22 s	CPU
TRAK (5 ckpts)	CIFAR-2 / ResNet-9	0.0290 ± 0.0523	691 s	GPU (T4)
Traceprop-LL	CIFAR-2 / ResNet-9	0.0168 ± 0.0684	2.6 s	CPU
Traceprop-BM	CIFAR-2 / ResNet-9	0.0033 ± 0.0334	14.2 s	CPU
Random	CIFAR-2 / ResNet-9	0.0205 ± 0.0357	—	—

Recommendation: use Traceprop-LL for tabular and linear models (it is exact for logistic regression). For deep vision models with BatchNorm, TRAK is preferred for quality; Traceprop-LL is 266× faster but scores near random on CIFAR-2 due to BatchNorm corrupting per-sample last-layer features.

Lineage overhead

Platform	Overhead	Mode
macOS (M-series)	1.007×	op-mode
Linux (x86-64)	0.979×	op-mode

Sub-1% overhead at 10⁶+ array elements.

Unlearning

Forget-set loss after gradient correction: 0.425 vs. gold standard (retrain from scratch): 0.401 vs. original: 0.379. Gap closed: >100% (exceeds retrain-from-scratch). Test accuracy drop: −0.5 pp (0.915 vs. 0.920).

Backends

Backend	Install	Usage
NumPy	built-in	`tp.from_numpy(arr)`
PyTorch	`pip install "traceprop[torch]"`	`tp.from_torch(tensor)`
JAX	`pip install "traceprop[jax]"`	`tp.from_jax(array)`

Provenance stores

By default Traceprop uses an in-memory store. For persistence:

# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")

# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")

Examples

examples/full_pipeline_demo.py — full end-to-end demo: two hospital CSVs → preprocessing → training → attribution → unlearning → compliance report
notebooks/tabular_logistic_lds_colab.ipynb — LDS benchmark on Adult Income (Colab, CPU)
notebooks/cifar2_resnet9_lds_colab.ipynb — LDS benchmark on CIFAR-2/ResNet-9 (Colab, GPU T4)
notebooks/homecredit_multisource_provenance_colab.ipynb — multi-source provenance case study (3-table credit risk data)

Project structure

traceprop/
  __init__.py            # public API
  tensor.py              # ProvenanceTensor (NumPy wrapper)
  graph.py               # lineage DAG
  query.py               # ProvenanceView
  interceptor.py         # op-level interception
  granularity.py         # Granularity modes
  compression.py         # ProvRC range compression
  exporters.py           # Parquet / OpenTelemetry exporters
  exceptions.py
  attribution/
    training_context.py  # TrainingContext, GradientStore
    gradient_store.py    # sparse JL projection
    influence.py         # compute_influence_scores
    attribution_engine.py
    streaming_context.py # online / continual learning
  backends/
    numpy_backend.py
    torch_backend.py
    jax_backend.py
  stores/
    memory_store.py
    sqlite_store.py
    postgres_store.py
  compliance/
    eu_ai_act.py         # EU AI Act Article 26 report generator
  unlearning/
    gradient_correction.py
  valuation/
    knn_shapley.py
  _c_ext/
    graph_ops.pyx        # optional Cython acceleration

Contributing

Issues and pull requests are welcome. Please open an issue before submitting a large PR.

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest

Citation

If you use Traceprop in research, please cite:

@misc{traceprop2025,
  author  = {Amit N.},
  title   = {Traceprop: End-to-End Data Provenance for Machine Learning Pipelines},
  year    = {2025},
  url     = {https://github.com/AmitoVrito/Traceprop},
}

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceprop-0.5.0.tar.gz (182.8 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceprop-0.5.0-py3-none-any.whl (49.8 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file traceprop-0.5.0.tar.gz.

File metadata

Download URL: traceprop-0.5.0.tar.gz
Upload date: May 4, 2026
Size: 182.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for traceprop-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`ec5eef3b8e583197b123daff2fc89c006814d44f5b034f54c0a1c24cf86c2d37`
MD5	`36a018b946b36516f43df0f3d5c2e637`
BLAKE2b-256	`f3f21f3485da70e8572a855ff2257449eafa42aae82cff5f74e61b3dcc5b27af`

See more details on using hashes here.

File details

Details for the file traceprop-0.5.0-py3-none-any.whl.

File metadata

Download URL: traceprop-0.5.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 49.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for traceprop-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7ff19f08bee62be234b247940e1cd718aa60e2bd354af483914cf427bfa00f3`
MD5	`89efcd103044c6d1740372e9fb313b88`
BLAKE2b-256	`ac304fd399d362953f2ceda6fc1e034589be170991a00311e18f1a4f0c44d4fc`

See more details on using hashes here.

traceprop 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Traceprop

What it does

Installation

Quick start

Core API

Provenance tracking

ProvenanceView

Attribution

Unlearning

Data valuation

Compliance

Granularity modes

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Lineage overhead

Unlearning

Backends

Provenance stores

Examples

Project structure

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes