Skip to main content

Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML

Project description

Traceprop

Computation-level data lineage, gradient attribution, and provenance-guided unlearning in production ML.

Traceprop is a Python library that connects raw source files through preprocessing, through model training, to individual predictions — and lets you act on that lineage via attribution, unlearning, and compliance reporting.

pip install traceprop

PyPI Python License DOI HF Space


🤗 Live Demo

Try Traceprop interactively — no install needed:

huggingface.co/spaces/Nautiverse/traceprop-demo

The demo covers all three core capabilities on the Wisconsin Breast Cancer dataset (CPU-only):

Tab What it shows
🎯 Attribution Pick any test sample — see top-K training points that drove the prediction, with influence scores in milliseconds
🗂️ Provenance Adjust a multi-source preprocessing pipeline and watch the lineage graph update live
🧹 Unlearning Choose a training sample to forget — see loss increase on that sample while test accuracy is preserved

Run the demo locally

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop/hf_space
pip install -r requirements.txt
python app.py
# → opens at http://127.0.0.1:7860

What it does

A single Traceprop query answers:

"This model made prediction X on input Z. Which rows in which source files, through which preprocessing steps, most influenced that prediction - and can we reduce that influence without retraining?"

Capability What you get
Lineage tracking Sub-1% overhead in op-mode; tracks every NumPy, PyTorch, and JAX operation
Attribution LDS 0.976 on Covertype 50K, 0.884 on Adult Income — at 0.22–5.2 s CPU, no GPU needed
Source-stratified attribution (SS) Aggregates per-sample scores to source-file level; 100% correct-source P@1 on realistic 3-table ETL schema; 0.89 ms/query
Approximate unlearning Provenance-guided gradient correction; closes >100% of the retrain-from-scratch gap on real data
Compliance reporting Structured JSON audit trail for EU AI Act Article 26 obligations
Data valuation KNN-Shapley values aggregated by source file and preprocessing op

Installation

# Core (NumPy only)
pip install traceprop

# With PyTorch support
pip install "traceprop[torch]"

# With JAX support
pip install "traceprop[jax]"

# With PostgreSQL provenance store
pip install "traceprop[postgres]"

# Everything
pip install "traceprop[all]"

Requires Python 3.10+.


Quick start

import traceprop as tp
import numpy as np

# 1. Load source data with provenance tracking
data_a = tp.from_csv("hospital_a.csv", source_id="hospital_a")
data_b = tp.from_csv("hospital_b.csv", source_id="hospital_b")

# 2. Preprocessing — every op is recorded in the lineage graph
norm_a = (data_a - data_a.mean(axis=0)) / (data_a.std(axis=0) + 1e-8)
norm_b = (data_b - data_b.mean(axis=0)) / (data_b.std(axis=0) + 1e-8)

# 3. Train with gradient recording
with tp.training_context(source_id="hospital_a") as ctx:
    train(model, X_train, y_train)   # your training loop here

# 4. Attribute a prediction back to source rows
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=10)

for entry in result.top(5):
    print(entry["source_id"], entry["sample_index"], entry["influence_score"])

# 5. Trace the top sample back to its source file and preprocessing ops
trace = result.trace_to_file(rank=0)
print(trace["sources"], trace["ops"])

# 6. Unlearn a data source without retraining
unlearn_result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",
    n_steps=300,
    lr=1e-2,
)
print(f"Verified: {unlearn_result.verified}")

# 7. Generate EU AI Act compliance report
report = tp.compliance_report(
    tensor=norm_a,
    system_name="CreditScorer-v1",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="compliance_report.json",
)

Core API

Provenance tracking

Function Description
tp.from_numpy(arr, source_id=...) Wrap a NumPy array with lineage tracking
tp.from_csv(path, source_id=...) Load CSV with lineage tracking
tp.from_torch(data, source_id=...) Wrap a PyTorch tensor
tp.from_jax(data, source_id=...) Wrap a JAX array
tp.array(data, source_id=...) Like np.array but tracked
tp.provenance(tensor) Get a ProvenanceView to query lineage
tp.reset_graph() Start a fresh lineage graph

ProvenanceView

view = tp.provenance(tensor)
view.ancestors()      # set of ancestor node IDs
view.ops()            # list of preprocessing operations
view.sources()        # list of source_ids in lineage

Attribution

# Record gradients during training
with tp.training_context(model, X_train, y_train, source_id="data", proj_dim=4096) as ctx:
    ...  # training loop

# Attribute a test prediction
engine = tp.attribution_engine(ctx.gradient_store)
result = engine.attribute(test_gradient, top_k=50)

result.top(10)            # list of dicts: sample_index, source_id, influence_score
result.trace_to_file(0)   # trace rank-0 sample to source file + ops
result.by_source()        # aggregate influence by source_id

GradientStore uses a sparse Johnson-Lindenstrauss projection (Achlioptas 2003) with {-1, 0, +1} coins. Default proj_dim=4096 works well for tabular models; use lower values for memory-constrained environments.

Unlearning

result = tp.unlearn(
    gradient_store=ctx.gradient_store,
    source_id="hospital_a",   # data source to forget
    n_steps=300,
    lr=1e-2,
    verification_threshold=0.05,
)
result.verified             # bool
result.influence_before     # float
result.influence_after      # float
result.compliance_report    # dict

Data valuation

val_result = tp.data_valuation(
    gradient_store=ctx.gradient_store,
    val_gradients=val_grads,   # (n_val, grad_dim) array
    k=10,
)
val_result.by_source()    # Shapley values aggregated by source
val_result.by_op()        # Shapley values aggregated by preprocessing op

Compliance

report = tp.compliance_report(
    tensor=output_tensor,
    system_name="MyModel",
    system_version="1.0.0",
    deployer_name="Amit N.",
    high_risk_category="credit_scoring",
    output_path="report.json",   # optional: write to file
)

Produces a structured JSON report covering EU AI Act Article 26 audit trail requirements for high-risk AI systems (enforcement backstop: 2 December 2027).

Granularity modes

tp.set_granularity(tp.Granularity.OP)      # default: track every op
tp.set_granularity(tp.Granularity.BATCH)   # batch-level only (lower overhead)
tp.set_granularity(tp.Granularity.EPOCH)   # epoch-level only

Benchmarks

Attribution quality (LDS — Linear Datamodeling Score)

Higher is better. Measured on 500 held-out retraining subsets.

Tabular / linear models

Method Dataset LDS Std Time Hardware
Traceprop-LL Adult Income (n=6K, d=105) 0.622 ±0.180 0.22 s CPU
Traceprop-LL + TRAK est. Adult Income (n=6K, d=105) 0.884 ±0.096 0.6 s CPU
Traceprop-LL Covertype (n=50K, d=54) 0.7513 ±0.1292 3.4 s CPU
Traceprop-LL + TRAK est. Covertype (n=50K, d=54) 0.9763 ±0.1052 5.2 s CPU
Traceprop-BM Adult Income 0.0127 ±0.0436 0.16 s CPU
Random ~0.000

Deep vision — end-to-end (BatchNorm)

Method Dataset LDS Std Time Hardware
TRAK (5 ckpts) CIFAR-2 / ResNet-9 0.0290 ±0.0523 691 s GPU (T4)
Traceprop-LL CIFAR-2 / ResNet-9 0.0168 ±0.0684 2.6 s CPU
Traceprop-BM CIFAR-2 / ResNet-9 0.0033 ±0.0334 14.2 s CPU
Random CIFAR-2 / ResNet-9 0.0205 ±0.0357

Deep vision — frozen backbone + linear probe (no BatchNorm)

Method Dataset LDS Std Time Hardware
Traceprop-LL (dot) CIFAR-2 / frozen ResNet-18 0.2642 ±0.1037 10.2 s CPU
Traceprop-LL + TRAK est. CIFAR-2 / frozen ResNet-18 0.2307 ±0.0459 1.4 s CPU
Random 0.0018

PyTorch MLP

Method Dataset LDS Std Time Hardware
Traceprop-LL + TRAK est. MNIST 4 vs 9 (784→256→1, n=6K) 0.1930 ±0.0581 0.82 s CPU
Random 0.0005

Recommendation: Traceprop-LL is exact for linear models and frozen-backbone architectures (no BatchNorm). Use it for tabular data — it matches or beats TRAK at CPU speeds. For end-to-end deep vision with BatchNorm, TRAK is preferred; Traceprop-LL is 266× faster but scores near random due to BatchNorm corrupting per-sample gradients. The fix is a frozen backbone: LDS improves 15.7× (0.0168 → 0.2642).

Lineage overhead

Platform Overhead Mode
macOS (M-series) 1.007× op-mode
Linux (x86-64) 0.979× op-mode

Sub-1% overhead at 10⁶+ array elements.

Unlearning

Dataset Method Forget-set Loss Gap Closed Test Acc.
Synthetic (n=1K) Original 0.379 0.920
Synthetic (n=1K) Gold (retrain) 0.401 100%
Synthetic (n=1K) Traceprop 0.425 >100% 0.915
Synthetic (n=1K) Random 0.382 17%
Adult Income (n=6K) Original 3.225 0.840
Adult Income (n=6K) Gold (retrain) 3.858 100%
Adult Income (n=6K) Traceprop 4.284 >100% (167%) 0.842
Adult Income (n=6K) Random 3.233 1.2%
Covertype (n=50K) Original 2.163 0.760
Covertype (n=50K) Gold (retrain) 2.402 100%
Covertype (n=50K) Traceprop 2.698 >100% (224%) 0.749
Covertype (n=50K) Random 2.162 −0.4%

Provenance-guided gradient correction closes >100% of the retrain-from-scratch gap at both scales. Test accuracy is fully preserved (Adult Income: 0.842 vs. 0.840 original; Covertype: 0.749 vs. 0.760 original — 1.1 pp drop). Random-sample baseline closes near 0% at both scales.

Source-stratified attribution (SS)

Traceprop-SS answers "which source file drove this prediction?" by aggregating per-sample TRAK scores to source-file level via the lineage graph. No prior attribution system (TRAK, LogIX, dattri) exposes source-file-level influence.

Controlled synthetic validation (exp17b — known ground truth, injected signal)

Schema Bureau P@1 Latency Speedup vs loop
3-source (bureau / application / prev_app, n=19,850) 0.970 0.89 ms/query 92×

Realistic ETL validation (exp21 — Home Credit schema, domain-motivated labels, no injected signal)

Schema Correct-source P@1 Baseline Latency
3-table (bureau / prev_app / application, n=9,800) 1.000 0.333 0.07 ms/query

Disjoint feature columns per source table (bureau: count/overdue_rate/log_amount; prev_app: count/approval_rate/log_amount; application: employment/region) produce orthogonal gradient subspaces. Traceprop-SS exploits this structure without any knowledge of which columns belong to which table.


Backends

Backend Install Usage
NumPy built-in tp.from_numpy(arr)
PyTorch pip install "traceprop[torch]" tp.from_torch(tensor)
JAX pip install "traceprop[jax]" tp.from_jax(array)

Provenance stores

By default Traceprop uses an in-memory store. For persistence:

# SQLite
from traceprop.stores.sqlite_store import SQLiteStore
store = SQLiteStore("lineage.db")

# PostgreSQL
from traceprop.stores.postgres_store import PostgresStore
store = PostgresStore("postgresql://user:pass@localhost/mydb")

Examples


Project structure

traceprop/
  __init__.py            # public API
  tensor.py              # ProvenanceTensor (NumPy wrapper)
  graph.py               # lineage DAG
  query.py               # ProvenanceView
  interceptor.py         # op-level interception
  granularity.py         # Granularity modes
  compression.py         # ProvRC range compression
  exporters.py           # Parquet / OpenTelemetry exporters
  exceptions.py
  attribution/
    training_context.py  # TrainingContext, GradientStore
    gradient_store.py    # sparse JL projection
    influence.py         # compute_influence_scores
    attribution_engine.py
    streaming_context.py # online / continual learning
  backends/
    numpy_backend.py
    torch_backend.py
    jax_backend.py
  stores/
    memory_store.py
    sqlite_store.py
    postgres_store.py
  compliance/
    eu_ai_act.py         # EU AI Act Article 26 report generator
  unlearning/
    gradient_correction.py
  valuation/
    knn_shapley.py
  _c_ext/
    graph_ops.pyx        # optional Cython acceleration

Contributing

Issues and pull requests are welcome. Please open an issue before submitting a large PR.

git clone https://github.com/AmitoVrito/Traceprop.git
cd Traceprop
pip install -e ".[dev]"
pytest

Citation

If you use Traceprop in research, please cite:

@misc{traceprop2027,
  author    = {Amit Nautiyal},
  title     = {{Traceprop}: Computation-Level Data Lineage, Gradient Attribution,
               and Provenance-Guided Unlearning in Production {ML}},
  year      = {2027},
  doi       = {10.5281/zenodo.20036000},
  url       = {https://zenodo.org/records/20036000},
  note      = {Software: \url{https://pypi.org/project/traceprop/}}
}

A Zenodo preprint is available at https://zenodo.org/records/20036000 (DOI: 10.5281/zenodo.20036000).


License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

traceprop-0.7.0.tar.gz (286.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

traceprop-0.7.0-py3-none-any.whl (55.7 kB view details)

Uploaded Python 3

File details

Details for the file traceprop-0.7.0.tar.gz.

File metadata

  • Download URL: traceprop-0.7.0.tar.gz
  • Upload date:
  • Size: 286.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for traceprop-0.7.0.tar.gz
Algorithm Hash digest
SHA256 97dc5a63b7586e492c148003087612189a6ba8dedfd24992226f68e6c56bcd3c
MD5 13b5c65dbaf56fecee8391af13a33367
BLAKE2b-256 2c26ecf51e0f1f115829175bafecfd7785177fdeb5f2b60033bedde8b5e52dba

See more details on using hashes here.

File details

Details for the file traceprop-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: traceprop-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 55.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for traceprop-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1cef319e7628b3fc6ed6c06a6b7ceb77c5884d5d5a07c2fa52a22e4671fa3613
MD5 b2a5142d31ef765648e4d4c83bf198fd
BLAKE2b-256 df28aa64a13e56311fea1296998b035d99f9e5b698b730f23d98da711dc99122

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page