Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML

These details have not been verified by PyPI

Project links

Project description

Traceprop

Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML.

Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.

pip install traceprop

🤗 Live Demo

Try Traceprop interactively — no install needed:

huggingface.co/spaces/Nautiverse/traceprop-demo

The demo covers all three core capabilities on the Wisconsin Breast Cancer dataset (CPU-only):

Tab	What it shows
🎯 Attribution	Pick any test sample — see top-K training points that drove the prediction, with influence scores in milliseconds
🗂️ Provenance	Adjust a multi-source preprocessing pipeline and watch the lineage graph update live
🧹 Unlearning	Choose a training sample to forget — see loss increase on that sample while test accuracy is preserved

Run the demo locally

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop/hf_space
pip install -r requirements.txt
python app.py
# → opens at http://127.0.0.1:7860

What it does

A single Traceprop query answers:

"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction - and can we reduce that influence without retraining?"

Capability	What you get
Lineage tracking	Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation
Attribution	LDS 0.976 on Covertype 50K, 0.884 on Adult Income — at 0.22–5.2 s CPU, no GPU needed
Source-stratified attribution (SS)	Aggregates per-sample scores to source-file level; 100% correct-source P@1 on realistic 3-table ETL schema; 0.89 ms/query
Approximate unlearning	Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap on real data
Compliance reporting	Structured JSON audit trail for EU AI Act Article 26 obligations
Data valuation	KNN-Shapley values aggregated by source file and preprocessing op

Installation

# Core (NumPy only)
pip install traceprop

# With PyTorch support
pip install "traceprop[torch]"

# With JAX support
pip install "traceprop[jax]"

# With PostgreSQL provenance store
pip install "traceprop[postgres]"

# Everything
pip install "traceprop[all]"

Requires Python 3.10+.

Quick start

import traceprop as tp
import numpy as np

# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")

# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)

# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
    train(model, X_train, y_train)   # your training loop here

# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)

for entry in result.top(5):
    print(entry["source_id"], entry["sample_index"], entry["influence_score"])

# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])

# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",
    n_steps=300,
    lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")

# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
    tensor=norm_a,
    system_name="CreditScorer-v1",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="compliance_report.json",
)

Core API

Provenance tracking

Function	Description
`tp.from_numpy(arr, source_id=...)`	Wrap a NumPy array with lineage tracking
`tp.from_csv(path, source_id=...)`	Load CSV with lineage tracking
`tp.from_torch(data, source_id=...)`	Wrap a PyTorch tensor
`tp.from_jax(data, source_id=...)`	Wrap a JAX array
`tp.array(data, source_id=...)`	Like `np.array` but tracked
`tp.provenance(tensor)`	Get a `ProvenanceView` to query lineage
`tp.reset_graph()`	Start a fresh lineage graph

ProvenanceView

view = tp.provenance(tensor)
view.ancestors()      # set of ancestor node IDs
view.ops()            # list of preprocessing operations
view.sources()        # list of source_ids in lineage

Attribution

# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
    ...  # training loop

# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)

result.top(10)            # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0)   # trace rank-0 sample to source file + ops
result.by_source()        # aggregate influence by source_id

GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.

Unlearning

result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",   # data source to forget
    n_steps=300,
    lr=1e-2,
    verification_threshold=0.05,
)
result.verified             # bool
result.influence_before     # float
result.influence_after      # float
result.compliance_report    # dict

Data valuation

val_result = tp.data_valuation(
    gradient_store=ctx.gradient_store,
    val_gradients=val_grads,   # (n_val, grad_dim) array
    k=10,
)
val_result.by_source()    # Shapley values aggregated by source
val_result.by_op()        # Shapley values aggregated by preprocessing op

Compliance

report = tp.compliance_report(
    tensor=output_tensor,
    system_name="MyModel",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="report.json",   # optional: write to file
)

Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).

Granularity modes

tp.set_granularity(tp.Granularity.OP)      # default: track every op
tp.set_granularity(tp.Granularity.BATCH)   # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH)   # epoch-level only

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Higher is better. Measured on 500 held-out retraining subsets.

Tabular / linear models

Method	Dataset	LDS	Std	Time	Hardware
Traceprop-LL	Adult Income (n=6K, d=105)	0.622	±0.180	0.22 s	CPU
Traceprop-LL + TRAK est.	Adult Income (n=6K, d=105)	0.884	±0.096	0.6 s	CPU
Traceprop-LL	Covertype (n=50K, d=54)	0.7513	±0.1292	3.4 s	CPU
Traceprop-LL + TRAK est.	Covertype (n=50K, d=54)	0.9763	±0.1052	5.2 s	CPU
Traceprop-BM	Adult Income	0.0127	±0.0436	0.16 s	CPU
Random	—	~0.000	—	—	—

Deep vision — end-to-end (BatchNorm)

Method	Dataset	LDS	Std	Time	Hardware
TRAK (5 ckpts)	CIFAR-2 / ResNet-9	0.0290	±0.0523	691 s	GPU (T4)
Traceprop-LL	CIFAR-2 / ResNet-9	0.0168	±0.0684	2.6 s	CPU
Traceprop-BM	CIFAR-2 / ResNet-9	0.0033	±0.0334	14.2 s	CPU
Random	CIFAR-2 / ResNet-9	0.0205	±0.0357	—	—

Deep vision — frozen backbone + linear probe (no BatchNorm)

Method	Dataset	LDS	Std	Time	Hardware
Traceprop-LL (dot)	CIFAR-2 / frozen ResNet-18	0.2642	±0.1037	10.2 s	CPU
Traceprop-LL + TRAK est.	CIFAR-2 / frozen ResNet-18	0.2307	±0.0459	1.4 s	CPU
Random	—	0.0018	—	—	—

PyTorch MLP

Method	Dataset	LDS	Std	Time	Hardware
Traceprop-LL + TRAK est.	MNIST 4 vs 9 (784→256→1, n=6K)	0.1930	±0.0581	0.82 s	CPU
Random	—	0.0005	—	—	—

Recommendation: Traceprop-LL is exact for linear models and frozen-backbone architectures (no BatchNorm). Use it for tabular data — it matches or beats TRAK at CPU speeds. For end-to-end deep vision with BatchNorm, TRAK is preferred; Traceprop-LL is 266× faster but scores near random due to BatchNorm corrupting per-sample gradients. The fix is a frozen backbone: LDS improves 15.7× (0.0168 → 0.2642).

Lineage overhead

Platform	Overhead	Mode
macOS (M-series)	1.007×	op-mode
Linux (x86-64)	0.979×	op-mode

Sub-1% overhead at 10⁶+ array elements.

Unlearning

Dataset	Method	Forget-set Loss	Gap Closed	Test Acc.
Synthetic (n=1K)	Original	0.379	—	0.920
Synthetic (n=1K)	Gold (retrain)	0.401	100%	—
Synthetic (n=1K)	Traceprop	0.425	>100%	0.915
Synthetic (n=1K)	Random	0.382	17%	—
Adult Income (n=6K)	Original	3.225	—	0.840
Adult Income (n=6K)	Gold (retrain)	3.858	100%	—
Adult Income (n=6K)	Traceprop	4.284	>100% (167%)	0.842
Adult Income (n=6K)	Random	3.233	1.2%	—
Covertype (n=50K)	Original	2.163	—	0.760
Covertype (n=50K)	Gold (retrain)	2.402	100%	—
Covertype (n=50K)	Traceprop	2.698	>100% (224%)	0.749
Covertype (n=50K)	Random	2.162	−0.4%	—

Provenance-guided gradient correction closes >100% of the retrain-from-scratch gap at both scales. Test accuracy is fully preserved (Adult Income: 0.842 vs. 0.840 original; Covertype: 0.749 vs. 0.760 original — 1.1 pp drop). Random-sample baseline closes near 0% at both scales.

Source-stratified attribution (SS)

Traceprop-SS answers "which source file drove this prediction?" by aggregating per-sample TRAK scores to source-file level via the lineage graph. No prior attribution system (TRAK, LogIX, dattri) exposes source-file-level influence.

Controlled synthetic validation (exp17b — known ground truth, injected signal)

Schema	Bureau P@1	Latency	Speedup vs loop
3-source (bureau / application / prev_app, n=19,850)	0.970	0.89 ms/query	92×

Realistic ETL validation (exp21 — Home Credit schema, domain-motivated labels, no injected signal)

Schema	Correct-source P@1	Baseline	Latency
3-table (bureau / prev_app / application, n=9,800)	1.000	0.333	0.07 ms/query

Disjoint feature columns per source table (bureau: count/overdue_rate/log_amount; prev_app: count/approval_rate/log_amount; application: employment/region) produce orthogonal gradient subspaces. Traceprop-SS exploits this structure without any knowledge of which columns belong to which table.

Backends

Backend	Install	Usage
NumPy	built-in	`tp.from_numpy(arr)`
PyTorch	`pip install "traceprop[torch]"`	`tp.from_torch(tensor)`
JAX	`pip install "traceprop[jax]"`	`tp.from_jax(array)`

Provenance stores

By default Traceprop uses an in-memory store. For persistence:

# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")

# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")

Examples

examples/full_pipeline_demo.py — full end-to-end demo: two hospital CSVs → preprocessing → training → attribution → unlearning → compliance report
notebooks/tabular_logistic_lds_colab.ipynb — LDS benchmark on Adult Income (Colab, CPU)
notebooks/cifar2_resnet9_lds_colab.ipynb — LDS benchmark on CIFAR-2/ResNet-9 (Colab, GPU T4)
notebooks/homecredit_multisource_provenance_colab.ipynb — multi-source provenance case study (3-table credit risk data)

Project structure

traceprop/
  __init__.py            # public API
  tensor.py              # ProvenanceTensor (NumPy wrapper)
  graph.py               # lineage DAG
  query.py               # ProvenanceView
  interceptor.py         # op-level interception
  granularity.py         # Granularity modes
  compression.py         # ProvRC range compression
  exporters.py           # Parquet / OpenTelemetry exporters
  exceptions.py
  attribution/
    training_context.py  # TrainingContext, GradientStore
    gradient_store.py    # sparse JL projection
    influence.py         # compute_influence_scores
    attribution_engine.py
    streaming_context.py # online / continual learning
  backends/
    numpy_backend.py
    torch_backend.py
    jax_backend.py
  stores/
    memory_store.py
    sqlite_store.py
    postgres_store.py
  compliance/
    eu_ai_act.py         # EU AI Act Article 26 report generator
  unlearning/
    gradient_correction.py
  valuation/
    knn_shapley.py
  _c_ext/
    graph_ops.pyx        # optional Cython acceleration

Contributing

Issues and pull requests are welcome. Please open an issue before submitting a large PR.

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest

Citation

If you use Traceprop in research, please cite:

@misc{traceprop2027,
  author    = {Amit Nautiyal},
  title     = {{Traceprop}: Computation-Level Data Lineage, Gradient Attribution,
               and Provenance-Guided Unlearning in Production {ML}},
  year      = {2027},
  doi       = {10.5281/zenodo.20036000},
  url       = {https://zenodo.org/records/20036000},
  note      = {Software: \url{https://pypi.org/project/traceprop/}}
}

A Zenodo preprint is available at https://zenodo.org/records/20036000 (DOI: 10.5281/zenodo.20036000).

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.7.0

Jun 21, 2026

0.6.0

May 20, 2026

0.5.0

May 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceprop-0.7.0.tar.gz (286.9 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

traceprop-0.7.0-py3-none-any.whl (55.7 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file traceprop-0.7.0.tar.gz.

File metadata

Download URL: traceprop-0.7.0.tar.gz
Upload date: Jun 21, 2026
Size: 286.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for traceprop-0.7.0.tar.gz
Algorithm	Hash digest
SHA256	`97dc5a63b7586e492c148003087612189a6ba8dedfd24992226f68e6c56bcd3c`
MD5	`13b5c65dbaf56fecee8391af13a33367`
BLAKE2b-256	`2c26ecf51e0f1f115829175bafecfd7785177fdeb5f2b60033bedde8b5e52dba`

See more details on using hashes here.

File details

Details for the file traceprop-0.7.0-py3-none-any.whl.

File metadata

Download URL: traceprop-0.7.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 55.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for traceprop-0.7.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1cef319e7628b3fc6ed6c06a6b7ceb77c5884d5d5a07c2fa52a22e4671fa3613`
MD5	`b2a5142d31ef765648e4d4c83bf198fd`
BLAKE2b-256	`df28aa64a13e56311fea1296998b035d99f9e5b698b730f23d98da711dc99122`

See more details on using hashes here.

traceprop 0.7.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Traceprop

🤗 Live Demo

Run the demo locally

What it does

Installation

Quick start

Core API

Provenance tracking

ProvenanceView

Attribution

Unlearning

Data valuation

Compliance

Granularity modes

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Lineage overhead

Unlearning

Source-stratified attribution (SS)

Backends

Provenance stores

Examples

Project structure

Contributing

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes