Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML
Project description
Traceprop
Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML.
Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.
pip install traceprop
🤗 Live Demo
Try Traceprop interactively — no install needed:
huggingface.co/spaces/Nautiverse/traceprop-demo
The demo covers all three core capabilities on the Wisconsin Breast Cancer dataset (CPU-only):
| Tab | What it shows |
|---|---|
| 🎯 Attribution | Pick any test sample — see top-K training points that drove the prediction, with influence scores in milliseconds |
| 🗂️ Provenance | Adjust a multi-source preprocessing pipeline and watch the lineage graph update live |
| 🧹 Unlearning | Choose a training sample to forget — see loss increase on that sample while test accuracy is preserved |
Run the demo locally
git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop/hf_space
pip install -r requirements.txt
python app.py
# → opens at http://127.0.0.1:7860
What it does
A single Traceprop query answers:
"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction - and can we reduce that influence without retraining?"
| Capability | What you get |
|---|---|
| Lineage tracking | Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation |
| Attribution | LDS 0.976 on Covertype 50K, 0.884 on Adult Income — at 0.22–5.2 s CPU, no GPU needed |
| Source-stratified attribution (SS) | Aggregates per-sample scores to source-file level; 100% correct-source P@1 on realistic 3-table ETL schema; 0.89 ms/query |
| Approximate unlearning | Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap on real data |
| Compliance reporting | Structured JSON audit trail for EU AI Act Article 26 obligations |
| Data valuation | KNN-Shapley values aggregated by source file and preprocessing op |
Installation
# Core (NumPy only)
pip install traceprop
# With PyTorch support
pip install "traceprop[torch]"
# With JAX support
pip install "traceprop[jax]"
# With PostgreSQL provenance store
pip install "traceprop[postgres]"
# Everything
pip install "traceprop[all]"
Requires Python 3.10+.
Quick start
import traceprop as tp
import numpy as np
# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")
# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)
# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
train(model, X_train, y_train) # your training loop here
# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)
for entry in result.top(5):
print(entry["source_id"], entry["sample_index"], entry["influence_score"])
# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])
# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
gradient_store=ctx.gradient_store,
source_id="hospital_a",
n_steps=300,
lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")
# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
tensor=norm_a,
system_name="CreditScorer-v1",
system_version="1.0.0",
deployer_name="Amit N.",
high_risk_category="credit_scoring",
output_path="compliance_report.json",
)
Core API
Provenance tracking
| Function | Description |
|---|---|
tp.from_numpy(arr, source_id=...) |
Wrap a NumPy array with lineage tracking |
tp.from_csv(path, source_id=...) |
Load CSV with lineage tracking |
tp.from_torch(data, source_id=...) |
Wrap a PyTorch tensor |
tp.from_jax(data, source_id=...) |
Wrap a JAX array |
tp.array(data, source_id=...) |
Like np.array but tracked |
tp.provenance(tensor) |
Get a ProvenanceView to query lineage |
tp.reset_graph() |
Start a fresh lineage graph |
ProvenanceView
view = tp.provenance(tensor)
view.ancestors() # set of ancestor node IDs
view.ops() # list of preprocessing operations
view.sources() # list of source_ids in lineage
Attribution
# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
... # training loop
# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)
result.top(10) # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0) # trace rank-0 sample to source file + ops
result.by_source() # aggregate influence by source_id
GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.
Unlearning
result = tp.unlearn(
gradient_store=ctx.gradient_store,
source_id="hospital_a", # data source to forget
n_steps=300,
lr=1e-2,
verification_threshold=0.05,
)
result.verified # bool
result.influence_before # float
result.influence_after # float
result.compliance_report # dict
Data valuation
val_result = tp.data_valuation(
gradient_store=ctx.gradient_store,
val_gradients=val_grads, # (n_val, grad_dim) array
k=10,
)
val_result.by_source() # Shapley values aggregated by source
val_result.by_op() # Shapley values aggregated by preprocessing op
Compliance
report = tp.compliance_report(
tensor=output_tensor,
system_name="MyModel",
system_version="1.0.0",
deployer_name="Amit N.",
high_risk_category="credit_scoring",
output_path="report.json", # optional: write to file
)
Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).
Granularity modes
tp.set_granularity(tp.Granularity.OP) # default: track every op
tp.set_granularity(tp.Granularity.BATCH) # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH) # epoch-level only
Benchmarks
Attribution quality (LDS — Linear Datamodeling Score)
Higher is better. Measured on 500 held-out retraining subsets.
Tabular / linear models
| Method | Dataset | LDS | Std | Time | Hardware |
|---|---|---|---|---|---|
| Traceprop-LL | Adult Income (n=6K, d=105) | 0.622 | ±0.180 | 0.22 s | CPU |
| Traceprop-LL + TRAK est. | Adult Income (n=6K, d=105) | 0.884 | ±0.096 | 0.6 s | CPU |
| Traceprop-LL | Covertype (n=50K, d=54) | 0.7513 | ±0.1292 | 3.4 s | CPU |
| Traceprop-LL + TRAK est. | Covertype (n=50K, d=54) | 0.9763 | ±0.1052 | 5.2 s | CPU |
| Traceprop-BM | Adult Income | 0.0127 | ±0.0436 | 0.16 s | CPU |
| Random | — | ~0.000 | — | — | — |
Deep vision — end-to-end (BatchNorm)
| Method | Dataset | LDS | Std | Time | Hardware |
|---|---|---|---|---|---|
| TRAK (5 ckpts) | CIFAR-2 / ResNet-9 | 0.0290 | ±0.0523 | 691 s | GPU (T4) |
| Traceprop-LL | CIFAR-2 / ResNet-9 | 0.0168 | ±0.0684 | 2.6 s | CPU |
| Traceprop-BM | CIFAR-2 / ResNet-9 | 0.0033 | ±0.0334 | 14.2 s | CPU |
| Random | CIFAR-2 / ResNet-9 | 0.0205 | ±0.0357 | — | — |
Deep vision — frozen backbone + linear probe (no BatchNorm)
| Method | Dataset | LDS | Std | Time | Hardware |
|---|---|---|---|---|---|
| Traceprop-LL (dot) | CIFAR-2 / frozen ResNet-18 | 0.2642 | ±0.1037 | 10.2 s | CPU |
| Traceprop-LL + TRAK est. | CIFAR-2 / frozen ResNet-18 | 0.2307 | ±0.0459 | 1.4 s | CPU |
| Random | — | 0.0018 | — | — | — |
PyTorch MLP
| Method | Dataset | LDS | Std | Time | Hardware |
|---|---|---|---|---|---|
| Traceprop-LL + TRAK est. | MNIST 4 vs 9 (784→256→1, n=6K) | 0.1930 | ±0.0581 | 0.82 s | CPU |
| Random | — | 0.0005 | — | — | — |
Recommendation: Traceprop-LL is exact for linear models and frozen-backbone architectures (no BatchNorm). Use it for tabular data — it matches or beats TRAK at CPU speeds. For end-to-end deep vision with BatchNorm, TRAK is preferred; Traceprop-LL is 266× faster but scores near random due to BatchNorm corrupting per-sample gradients. The fix is a frozen backbone: LDS improves 15.7× (0.0168 → 0.2642).
Lineage overhead
| Platform | Overhead | Mode |
|---|---|---|
| macOS (M-series) | 1.007× | op-mode |
| Linux (x86-64) | 0.979× | op-mode |
Sub-1% overhead at 10⁶+ array elements.
Unlearning
| Dataset | Method | Forget-set Loss | Gap Closed | Test Acc. |
|---|---|---|---|---|
| Synthetic (n=1K) | Original | 0.379 | — | 0.920 |
| Synthetic (n=1K) | Gold (retrain) | 0.401 | 100% | — |
| Synthetic (n=1K) | Traceprop | 0.425 | >100% | 0.915 |
| Synthetic (n=1K) | Random | 0.382 | 17% | — |
| Adult Income (n=6K) | Original | 3.225 | — | 0.840 |
| Adult Income (n=6K) | Gold (retrain) | 3.858 | 100% | — |
| Adult Income (n=6K) | Traceprop | 4.284 | >100% (167%) | 0.842 |
| Adult Income (n=6K) | Random | 3.233 | 1.2% | — |
| Covertype (n=50K) | Original | 2.163 | — | 0.760 |
| Covertype (n=50K) | Gold (retrain) | 2.402 | 100% | — |
| Covertype (n=50K) | Traceprop | 2.698 | >100% (224%) | 0.749 |
| Covertype (n=50K) | Random | 2.162 | −0.4% | — |
Provenance-guided gradient correction closes >100% of the retrain-from-scratch gap at both scales. Test accuracy is fully preserved (Adult Income: 0.842 vs. 0.840 original; Covertype: 0.749 vs. 0.760 original — 1.1 pp drop). Random-sample baseline closes near 0% at both scales.
Source-stratified attribution (SS)
Traceprop-SS answers "which source file drove this prediction?" by aggregating per-sample TRAK scores to source-file level via the lineage graph. No prior attribution system (TRAK, LogIX, dattri) exposes source-file-level influence.
Controlled synthetic validation (exp17b — known ground truth, injected signal)
| Schema | Bureau P@1 | Latency | Speedup vs loop |
|---|---|---|---|
| 3-source (bureau / application / prev_app, n=19,850) | 0.970 | 0.89 ms/query | 92× |
Realistic ETL validation (exp21 — Home Credit schema, domain-motivated labels, no injected signal)
| Schema | Correct-source P@1 | Baseline | Latency |
|---|---|---|---|
| 3-table (bureau / prev_app / application, n=9,800) | 1.000 | 0.333 | 0.07 ms/query |
Disjoint feature columns per source table (bureau: count/overdue_rate/log_amount; prev_app: count/approval_rate/log_amount; application: employment/region) produce orthogonal gradient subspaces. Traceprop-SS exploits this structure without any knowledge of which columns belong to which table.
Backends
| Backend | Install | Usage |
|---|---|---|
| NumPy | built-in | tp.from_numpy(arr) |
| PyTorch | pip install "traceprop[torch]" |
tp.from_torch(tensor) |
| JAX | pip install "traceprop[jax]" |
tp.from_jax(array) |
Provenance stores
By default Traceprop uses an in-memory store. For persistence:
# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")
# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")
Examples
examples/full_pipeline_demo.py— full end-to-end demo: two hospital CSVs → preprocessing → training → attribution → unlearning → compliance reportnotebooks/tabular_logistic_lds_colab.ipynb— LDS benchmark on Adult Income (Colab, CPU)notebooks/cifar2_resnet9_lds_colab.ipynb— LDS benchmark on CIFAR-2/ResNet-9 (Colab, GPU T4)notebooks/homecredit_multisource_provenance_colab.ipynb— multi-source provenance case study (3-table credit risk data)
Project structure
traceprop/
__init__.py # public API
tensor.py # ProvenanceTensor (NumPy wrapper)
graph.py # lineage DAG
query.py # ProvenanceView
interceptor.py # op-level interception
granularity.py # Granularity modes
compression.py # ProvRC range compression
exporters.py # Parquet / OpenTelemetry exporters
exceptions.py
attribution/
training_context.py # TrainingContext, GradientStore
gradient_store.py # sparse JL projection
influence.py # compute_influence_scores
attribution_engine.py
streaming_context.py # online / continual learning
backends/
numpy_backend.py
torch_backend.py
jax_backend.py
stores/
memory_store.py
sqlite_store.py
postgres_store.py
compliance/
eu_ai_act.py # EU AI Act Article 26 report generator
unlearning/
gradient_correction.py
valuation/
knn_shapley.py
_c_ext/
graph_ops.pyx # optional Cython acceleration
Contributing
Issues and pull requests are welcome. Please open an issue before submitting a large PR.
git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest
Citation
If you use Traceprop in research, please cite:
@misc{traceprop2027,
author = {Amit Nautiyal},
title = {{Traceprop}: Computation-Level Data Lineage, Gradient Attribution,
and Provenance-Guided Unlearning in Production {ML}},
year = {2027},
doi = {10.5281/zenodo.20036000},
url = {https://zenodo.org/records/20036000},
note = {Software: \url{https://pypi.org/project/traceprop/}}
}
A Zenodo preprint is available at https://zenodo.org/records/20036000 (DOI: 10.5281/zenodo.20036000).
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file traceprop-0.7.0.tar.gz.
File metadata
- Download URL: traceprop-0.7.0.tar.gz
- Upload date:
- Size: 286.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97dc5a63b7586e492c148003087612189a6ba8dedfd24992226f68e6c56bcd3c
|
|
| MD5 |
13b5c65dbaf56fecee8391af13a33367
|
|
| BLAKE2b-256 |
2c26ecf51e0f1f115829175bafecfd7785177fdeb5f2b60033bedde8b5e52dba
|
File details
Details for the file traceprop-0.7.0-py3-none-any.whl.
File metadata
- Download URL: traceprop-0.7.0-py3-none-any.whl
- Upload date:
- Size: 55.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1cef319e7628b3fc6ed6c06a6b7ceb77c5884d5d5a07c2fa52a22e4671fa3613
|
|
| MD5 |
b2a5142d31ef765648e4d4c83bf198fd
|
|
| BLAKE2b-256 |
df28aa64a13e56311fea1296998b035d99f9e5b698b730f23d98da711dc99122
|