Automatic ML data lineage tracking with zero manual logging

These details have not been verified by PyPI

Project links

Project description

AutoLineage

Zero-code data lineage for Python ML pipelines.

AutoLineage automatically records every DataFrame operation, model training step, and metric evaluation across pandas, scikit-learn, and PySpark — and then detects anomalies and pinpoints root causes when something goes wrong. One import activates 288 hooks. No decorators, no wrapper classes, no configuration files.

import autolineage.auto        # that's the whole setup

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

df = pd.read_csv("data.csv").dropna()
X = df.drop(columns=['target'])
y = df['target']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier().fit(X_tr, y_tr)
preds = model.predict(X_te)
score = f1_score(y_te, preds)

# AutoLineage has tracked every line above into one DAG.
from autolineage.auto import get_tracker
get_tracker().visualize()      # opens an interactive lineage graph

Interactive lineage graph

Click any node to see operation metadata, shape changes, and upstream dependencies. Export to JSON, Graphviz DOT, Mermaid markup, or self-contained HTML.

Why AutoLineage?

ML pipelines fail silently. A model whose F1 drops from 0.98 to 0.00 invites hours of print(df.shape) debugging. Existing tools either require explicit instrumentation (MLflow), track only files (DVC), or cover only a single stage (Evidently, Arize). No existing tool records the complete path from read_csv through f1_score in one graph automatically — and then tells you which operation caused a metric to drop.

AutoLineage closes that gap.

Compared to other tools

Capability	AutoLineage	MLflow	Evidently	OpenLineage	DataLineagePy
Zero code changes	Yes	No	No	No	No (wrapper)
Operation-level	Yes	No	No	Job-level	Yes
Cross-framework	pandas + sklearn + PySpark	—	—	Spark only	pandas only
End-to-end trace	Yes	No	No	No	No
Anomaly detection	Yes	No	Drift only	No	No
Root-cause localization	Yes	No	No	No	No
Interactive visualization	Yes	Web UI	Web UI	Web UI	No

Catching pipeline bugs automatically

This is what AutoLineage is for. Run python examples/anomaly_demo.py and watch a single-line filter bug get detected and localized:

Anomaly detection terminal output

The demo runs the same pipeline twice — once cleanly, once with a corrupted filter — and AutoLineage catches the row-count anomaly, the F1 collapse, and identifies the exact line that caused both. No manual instrumentation, no print statements.

Installation

pip install autolineage

Optional extras:

pip install autolineage[jupyter]   # rich notebook output
pip install autolineage[dev]       # tests and benchmarks

Quick Start

1. Automatic tracking (one line)

import autolineage.auto         # MUST be the first autolineage line in your script

# Use pandas / sklearn / pyspark normally — every operation is tracked
import pandas as pd
df = pd.read_csv("data.csv").dropna().drop_duplicates()

from autolineage.auto import get_tracker
get_tracker().visualize()       # opens HTML graph in your browser

Why first? import autolineage.auto patches framework methods at import time. If you write from sklearn.metrics import f1_score before this line, your local f1_score reference will bypass the wrapper. AutoLineage will warn you when this happens, but the easiest fix is to put import autolineage.auto at the top of your file.

2. Visualize the lineage

tracker = get_tracker()

tracker.visualize()                          # interactive HTML, opens in browser
tracker.visualize("trace.html")              # custom path, no browser pop-up
tracker.to_dot()                             # Graphviz DOT
tracker.to_mermaid()                         # Markdown-friendly Mermaid

In Jupyter notebooks, putting the tracker as the last expression in a cell auto-renders a summary table:

get_tracker()  # in a Jupyter cell — produces a rich HTML table inline

3. Anomaly detection

from autolineage.core.analyzer import LineageAnalyzer

analyzer = LineageAnalyzer(tracker)
analyzer.load_baseline("baseline.json")        # compare against a saved healthy run
anomalies = analyzer.detect_anomalies()

for a in anomalies:
    print(f"[{a.severity}] {a.message}")
# [critical] filter row change: -47,500 (baseline: -50, 94900% deviation)
# [critical] f1_score dropped from 0.9842 to 0.0000 (-100.0%)

4. Root-cause localization

cause = analyzer.localize_root_cause("f1_score")
print(cause.explanation)
# "The most likely cause of f1_score degradation (from 0.9842 to 0.0000)
#  is 'filter' at step 5. Row change was -47,500 (baseline: -50)."

5. Save a fingerprint for future comparison

analyzer.save_fingerprint("baseline.json")     # after a healthy run

# Next run, in a different process:
analyzer = LineageAnalyzer(new_tracker)
analyzer.load_baseline("baseline.json")
anomalies = analyzer.detect_anomalies()

What Gets Tracked

pandas (64 hooks): read_csv, to_csv, read_parquet, to_parquet, dropna, fillna, merge, concat, groupby + aggregations, drop_duplicates, boolean filtering, assign, sort_values, pivot_table, melt, plus 40+ more.

scikit-learn (175 hooks): train_test_split, estimator fit / predict / predict_proba / score across 30+ classes (RandomForest, LogisticRegression, DecisionTree, SVC, KNN, GradientBoosting, etc.), 18 preprocessor classes, 15 metric functions.

PySpark (49 hooks): DataFrame transforms, groupBy + aggregations, join variants, reader / writer methods, actions.

See autolineage/hooks/ for the full list.

Example Output

On a 284K-row credit card fraud detection pipeline (paper/credit_card_pipeline.py):

 1. [io        ] read_csv -> (284807, 31)                    [1280ms]
 2. [transform ] drop_duplicates (-1,081 rows)                [827ms]
 3. [transform ] filter (-284 rows)
 4. [transform ] assign -> 36 cols                              [1ms]
 5. [transform ] select -> 34 cols
 6. [split     ] train_test_split (80/20)                     [218ms]
 7. [preprocess] StandardScaler.fit_transform                 [201ms]
 8. [preprocess] StandardScaler.transform                      [17ms]
 9. [train     ] RandomForestClassifier.fit                 [88637ms]
10. [train     ] LogisticRegression.fit                      [1138ms]
11. [predict   ] RandomForestClassifier.predict               [332ms]
12. [predict   ] LogisticRegression.predict                     [4ms]
13. [predict   ] RandomForestClassifier.predict_proba         [311ms]
14. [evaluate  ] accuracy_score    = 0.9995
15. [evaluate  ] precision_score   = 0.8824
16. [evaluate  ] recall_score      = 0.7895
17. [evaluate  ] f1_score          = 0.8333
18. [evaluate  ] roc_auc_score     = 0.9871

24 clean records. Zero noise. End-to-end trace from CSV to metrics.

Architecture

Plugin-based. Each library is a single file implementing BaseHookProvider. Adding new libraries requires ~200 lines and zero changes to the core.

   User Code (unchanged)
           |
   Hook Providers (pandas | sklearn | pyspark | ...)
           |
   UnifiedTracker + TransformationRecord
           |
   LineageAnalyzer  →  anomalies, root causes, fingerprints
   Visualizer       →  HTML / DOT / Mermaid / Jupyter

Performance

Per-operation instrumentation cost on a 37-operation pipeline (Intel i7-12700H, Python 3.12, pandas 3.0):

Condition	Mean time per call	95% CI
Baseline (no instrumentation)	263.5 µs	± 8.8 µs
With AutoLineage	348.2 µs	± 9.0 µs
Overhead	84.7 µs / op	[78, 91]

At production data scales (≥10⁵ rows), end-to-end overhead becomes indistinguishable from baseline variance because framework computation dominates wall-clock time. See paper/scaling_results.csv for the full scaling study.

Limitations

Single-process. Pipelines spanning multiple machines require manual trace correlation. OpenTelemetry export is planned.
Monkey-patching is version-sensitive. Tested against pandas 2.x / 3.x, scikit-learn 1.x, PySpark 3.x / 4.x.
Import order matters. import autolineage.auto must come before from sklearn.metrics import f1_score (or any other hooked symbol) — otherwise the local reference will bypass the wrapper. AutoLineage will warn you when this happens.
C-extension code is invisible. Operations that execute entirely in compiled code without re-entering Python (e.g., certain numpy reductions) are not captured.
Python-only. R, Julia, Java are out of scope.

Contributing

Add a new library in 5 steps:

Create autolineage/hooks/your_lib_hooks.py
Subclass BaseHookProvider
Implement install(tracker) and uninstall()
Register in autolineage/hooks/registry.py
Open a PR

See autolineage/hooks/pandas_io.py for the smallest working example (~110 LoC).

Development

git clone https://github.com/kishanraj41/autolineage
cd autolineage
pip install -e ".[dev]"
pytest tests/                      # 51 tests
python examples/anomaly_demo.py    # full end-to-end demo

License

MIT

Citation

If you use AutoLineage in your research, please cite:

@misc{vandhavasi2026autolineage,
  title={AutoLineage: Operation-Level Data Lineage for Python ML Pipelines via Import-Time Hooking},
  author={Vandhavasi, Kishan Raj},
  year={2026},
  eprint={2604.XXXXX},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

May 23, 2026

0.4.1

Apr 27, 2026

This version

0.4.0 yanked

Apr 27, 2026

0.3.0

Apr 17, 2026

0.2.0

Feb 28, 2026

0.1.0

Feb 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autolineage-0.4.0.tar.gz (72.4 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autolineage-0.4.0-py3-none-any.whl (67.8 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file autolineage-0.4.0.tar.gz.

File metadata

Download URL: autolineage-0.4.0.tar.gz
Upload date: Apr 27, 2026
Size: 72.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`50f4472ccb785757f3e9650087f5336c9de15cd70b763fde5adc3ad219179831`
MD5	`920d06f2b2d7b9128faeec48f0f5a888`
BLAKE2b-256	`5189c0223512eb51de32eddb0bc9c38db647b266ea5cbe2aa4f95a8e92e30814`

See more details on using hashes here.

File details

Details for the file autolineage-0.4.0-py3-none-any.whl.

File metadata

Download URL: autolineage-0.4.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 67.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1b2bd6ca7d5d7487c153d439d949c11ddd63ee5f8324228a379262b0b27add6f`
MD5	`cb9dab96f0aab006ffb35ca30d768773`
BLAKE2b-256	`912f910b71844aed8a16ff4ad9d6f89aa88a5bcaa181ed315866788c85b477c8`

See more details on using hashes here.

autolineage 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoLineage

Why AutoLineage?

Compared to other tools

Catching pipeline bugs automatically

Installation

Quick Start

1. Automatic tracking (one line)

2. Visualize the lineage

3. Anomaly detection

4. Root-cause localization

5. Save a fingerprint for future comparison

What Gets Tracked

Example Output

Architecture

Performance

Limitations

Contributing

Development

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes