Skip to main content

Automatic ML data lineage tracking with zero manual logging

Project description

AutoLineage

Zero-code data lineage for Python ML pipelines.

AutoLineage automatically records every DataFrame operation, model training step, and metric evaluation across pandas, scikit-learn, and PySpark — and then detects anomalies and pinpoints root causes when something goes wrong. One import activates 288 hooks. No decorators, no wrapper classes, no configuration files.

import autolineage.auto        # that's the whole setup

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

df = pd.read_csv("data.csv").dropna()
X = df.drop(columns=['target'])
y = df['target']
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier().fit(X_tr, y_tr)
preds = model.predict(X_te)
score = f1_score(y_te, preds)

# AutoLineage has tracked every line above into one DAG.
from autolineage.auto import get_tracker
get_tracker().visualize()      # opens an interactive lineage graph

Interactive lineage graph

Click any node to see operation metadata, shape changes, and upstream dependencies. Export to JSON, Graphviz DOT, Mermaid markup, or self-contained HTML.


Why AutoLineage?

ML pipelines fail silently. A model whose F1 drops from 0.98 to 0.00 invites hours of print(df.shape) debugging. Existing tools either require explicit instrumentation (MLflow), track only files (DVC), or cover only a single stage (Evidently, Arize). No existing tool records the complete path from read_csv through f1_score in one graph automatically — and then tells you which operation caused a metric to drop.

AutoLineage closes that gap.

Compared to other tools

Capability AutoLineage MLflow Evidently OpenLineage DataLineagePy
Zero code changes Yes No No No No (wrapper)
Operation-level Yes No No Job-level Yes
Cross-framework pandas + sklearn + PySpark Spark only pandas only
End-to-end trace Yes No No No No
Anomaly detection Yes No Drift only No No
Root-cause localization Yes No No No No
Interactive visualization Yes Web UI Web UI Web UI No

Catching pipeline bugs automatically

This is what AutoLineage is for. Run python examples/anomaly_demo.py and watch a single-line filter bug get detected and localized:

Anomaly detection terminal output

The demo runs the same pipeline twice — once cleanly, once with a corrupted filter — and AutoLineage catches the row-count anomaly, the F1 collapse, and identifies the exact line that caused both. No manual instrumentation, no print statements.


Installation

pip install autolineage

Optional extras:

pip install autolineage[jupyter]   # rich notebook output
pip install autolineage[dev]       # tests and benchmarks

Quick Start

1. Automatic tracking (one line)

import autolineage.auto         # MUST be the first autolineage line in your script

# Use pandas / sklearn / pyspark normally — every operation is tracked
import pandas as pd
df = pd.read_csv("data.csv").dropna().drop_duplicates()

from autolineage.auto import get_tracker
get_tracker().visualize()       # opens HTML graph in your browser

Why first? import autolineage.auto patches framework methods at import time. If you write from sklearn.metrics import f1_score before this line, your local f1_score reference will bypass the wrapper. AutoLineage will warn you when this happens, but the easiest fix is to put import autolineage.auto at the top of your file.

2. Visualize the lineage

tracker = get_tracker()

tracker.visualize()                          # interactive HTML, opens in browser
tracker.visualize("trace.html")              # custom path, no browser pop-up
tracker.to_dot()                             # Graphviz DOT
tracker.to_mermaid()                         # Markdown-friendly Mermaid

In Jupyter notebooks, putting the tracker as the last expression in a cell auto-renders a summary table:

get_tracker()  # in a Jupyter cell — produces a rich HTML table inline

3. Anomaly detection

from autolineage.core.analyzer import LineageAnalyzer

analyzer = LineageAnalyzer(tracker)
analyzer.load_baseline("baseline.json")        # compare against a saved healthy run
anomalies = analyzer.detect_anomalies()

for a in anomalies:
    print(f"[{a.severity}] {a.message}")
# [critical] filter row change: -47,500 (baseline: -50, 94900% deviation)
# [critical] f1_score dropped from 0.9842 to 0.0000 (-100.0%)

4. Root-cause localization

cause = analyzer.localize_root_cause("f1_score")
print(cause.explanation)
# "The most likely cause of f1_score degradation (from 0.9842 to 0.0000)
#  is 'filter' at step 5. Row change was -47,500 (baseline: -50)."

5. Save a fingerprint for future comparison

analyzer.save_fingerprint("baseline.json")     # after a healthy run

# Next run, in a different process:
analyzer = LineageAnalyzer(new_tracker)
analyzer.load_baseline("baseline.json")
anomalies = analyzer.detect_anomalies()

What Gets Tracked

pandas (64 hooks): read_csv, to_csv, read_parquet, to_parquet, dropna, fillna, merge, concat, groupby + aggregations, drop_duplicates, boolean filtering, assign, sort_values, pivot_table, melt, plus 40+ more.

scikit-learn (175 hooks): train_test_split, estimator fit / predict / predict_proba / score across 30+ classes (RandomForest, LogisticRegression, DecisionTree, SVC, KNN, GradientBoosting, etc.), 18 preprocessor classes, 15 metric functions.

PySpark (49 hooks): DataFrame transforms, groupBy + aggregations, join variants, reader / writer methods, actions.

See autolineage/hooks/ for the full list.


Example Output

On a 284K-row credit card fraud detection pipeline (paper/credit_card_pipeline.py):

 1. [io        ] read_csv -> (284807, 31)                    [1280ms]
 2. [transform ] drop_duplicates (-1,081 rows)                [827ms]
 3. [transform ] filter (-284 rows)
 4. [transform ] assign -> 36 cols                              [1ms]
 5. [transform ] select -> 34 cols
 6. [split     ] train_test_split (80/20)                     [218ms]
 7. [preprocess] StandardScaler.fit_transform                 [201ms]
 8. [preprocess] StandardScaler.transform                      [17ms]
 9. [train     ] RandomForestClassifier.fit                 [88637ms]
10. [train     ] LogisticRegression.fit                      [1138ms]
11. [predict   ] RandomForestClassifier.predict               [332ms]
12. [predict   ] LogisticRegression.predict                     [4ms]
13. [predict   ] RandomForestClassifier.predict_proba         [311ms]
14. [evaluate  ] accuracy_score    = 0.9995
15. [evaluate  ] precision_score   = 0.8824
16. [evaluate  ] recall_score      = 0.7895
17. [evaluate  ] f1_score          = 0.8333
18. [evaluate  ] roc_auc_score     = 0.9871

24 clean records. Zero noise. End-to-end trace from CSV to metrics.


Architecture

Plugin-based. Each library is a single file implementing BaseHookProvider. Adding new libraries requires ~200 lines and zero changes to the core.

   User Code (unchanged)
           |
   Hook Providers (pandas | sklearn | pyspark | ...)
           |
   UnifiedTracker + TransformationRecord
           |
   LineageAnalyzer  →  anomalies, root causes, fingerprints
   Visualizer       →  HTML / DOT / Mermaid / Jupyter

Performance

Per-operation instrumentation cost on a 37-operation pipeline (Intel i7-12700H, Python 3.12, pandas 3.0):

Condition Mean time per call 95% CI
Baseline (no instrumentation) 263.5 µs ± 8.8 µs
With AutoLineage 348.2 µs ± 9.0 µs
Overhead 84.7 µs / op [78, 91]

At production data scales (≥10⁵ rows), end-to-end overhead becomes indistinguishable from baseline variance because framework computation dominates wall-clock time. See paper/scaling_results.csv for the full scaling study.


Limitations

  • Single-process. Pipelines spanning multiple machines require manual trace correlation. OpenTelemetry export is planned.
  • Monkey-patching is version-sensitive. Tested against pandas 2.x / 3.x, scikit-learn 1.x, PySpark 3.x / 4.x.
  • Import order matters. import autolineage.auto must come before from sklearn.metrics import f1_score (or any other hooked symbol) — otherwise the local reference will bypass the wrapper. AutoLineage will warn you when this happens.
  • C-extension code is invisible. Operations that execute entirely in compiled code without re-entering Python (e.g., certain numpy reductions) are not captured.
  • Python-only. R, Julia, Java are out of scope.

Contributing

Add a new library in 5 steps:

  1. Create autolineage/hooks/your_lib_hooks.py
  2. Subclass BaseHookProvider
  3. Implement install(tracker) and uninstall()
  4. Register in autolineage/hooks/registry.py
  5. Open a PR

See autolineage/hooks/pandas_io.py for the smallest working example (~110 LoC).


Development

git clone https://github.com/kishanraj41/autolineage
cd autolineage
pip install -e ".[dev]"
pytest tests/                      # 51 tests
python examples/anomaly_demo.py    # full end-to-end demo

License

MIT


Citation

If you use AutoLineage in your research, please cite:

@misc{vandhavasi2026autolineage,
  title={AutoLineage: Operation-Level Data Lineage for Python ML Pipelines via Import-Time Hooking},
  author={Vandhavasi, Kishan Raj},
  year={2026},
  eprint={2604.XXXXX},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autolineage-0.4.0.tar.gz (72.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autolineage-0.4.0-py3-none-any.whl (67.8 kB view details)

Uploaded Python 3

File details

Details for the file autolineage-0.4.0.tar.gz.

File metadata

  • Download URL: autolineage-0.4.0.tar.gz
  • Upload date:
  • Size: 72.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.4.0.tar.gz
Algorithm Hash digest
SHA256 50f4472ccb785757f3e9650087f5336c9de15cd70b763fde5adc3ad219179831
MD5 920d06f2b2d7b9128faeec48f0f5a888
BLAKE2b-256 5189c0223512eb51de32eddb0bc9c38db647b266ea5cbe2aa4f95a8e92e30814

See more details on using hashes here.

File details

Details for the file autolineage-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: autolineage-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 67.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b2bd6ca7d5d7487c153d439d949c11ddd63ee5f8324228a379262b0b27add6f
MD5 cb9dab96f0aab006ffb35ca30d768773
BLAKE2b-256 912f910b71844aed8a16ff4ad9d6f89aa88a5bcaa181ed315866788c85b477c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page