Skip to main content

Automatic ML data lineage tracking with zero manual logging

Project description

AutoLineage

Zero-code data lineage for Python ML pipelines.

AutoLineage automatically records every DataFrame operation, model training step, and metric evaluation across pandas, scikit-learn, and PySpark. One import activates 288 hooks. No decorators, no wrapper classes, no configuration files.

import autolineage.auto        # that's the whole setup

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

df = pd.read_csv("data.csv")
df = df.dropna()
X, y = df.drop(columns=['target']), df['target']

model = RandomForestClassifier().fit(X, y)
preds = model.predict(X_test)
score = f1_score(y_test, preds)

# AutoLineage has now tracked every operation above into one unified DAG.

Why AutoLineage?

ML pipelines fail silently. A model whose F1 drops from 0.95 to 0.60 invites hours of print(df.shape) debugging. Existing tools either require explicit instrumentation (MLflow), track only files (DVC), or cover only a single stage (Evidently, Arize). No existing tool records the complete path from read_csv through f1_score in one graph automatically.

AutoLineage closes that gap.

Compared to other tools

Capability AutoLineage MLflow Evidently OpenLineage DataLineagePy
Zero code changes Yes No No No No (wrapper)
Operation-level Yes No No Job-level Yes
Cross-framework pandas + sklearn + PySpark Spark only pandas only
End-to-end trace Yes No No No No
Anomaly detection Yes No Drift only No No
Root-cause localization Yes No No No No

Installation

pip install autolineage

Quick Start

1. Automatic tracking (one line)

import autolineage.auto

# Use pandas and sklearn normally
import pandas as pd
df = pd.read_csv("iris.csv")
df = df.dropna().drop_duplicates()

# See what happened
from autolineage.auto import get_tracker
tracker = get_tracker()
for rec in tracker.records:
    print(f"{rec.operation}: {rec.input_shape} -> {rec.output_shape}")

2. Anomaly detection

from autolineage.core.analyzer import LineageAnalyzer

analyzer = LineageAnalyzer(tracker)
anomalies = analyzer.detect_anomalies()

for a in anomalies:
    print(f"[{a.severity}] {a.message}")
# [critical] filter removed 99.9% of rows (100000 -> 50)
# [critical] f1_score = 0.0 (model may not be learning)

3. Root-cause localization

cause = analyzer.localize_root_cause(metric_name="accuracy")
print(cause.explanation)
# "The most likely cause of accuracy degradation is 'filter' at step 3.
#  Row change was -99,950 (baseline: -2,100)."

4. Save a fingerprint for future comparison

# After a healthy run
analyzer.save_fingerprint("baseline.json")

# On the next run
analyzer.load_baseline("baseline.json")
anomalies = analyzer.detect_anomalies()  # compared to baseline

What Gets Tracked

pandas (64 hooks): read_csv, to_csv, dropna, fillna, merge, concat, groupby + aggregations, drop_duplicates, filter, assign, sort_values, pivot_table, melt, plus 40+ more.

scikit-learn (175 hooks): train_test_split, estimator fit/predict/predict_proba/score (RandomForest, LogisticRegression, DecisionTree, SVC, KNN, etc.), 18 preprocessor classes, 15 metric functions.

PySpark (49 hooks): DataFrame transforms, groupBy aggregations, join variants, reader/writer methods, actions.


Example Output

On a 284K-row credit card fraud detection pipeline:

 1. [io        ] read_csv -> (284807, 31)                    [1280ms]
 2. [transform ] drop_duplicates (-1,081 rows)                [827ms]
 3. [transform ] filter (-284 rows)
 4. [transform ] assign -> 36 cols                              [1ms]
 5. [transform ] select -> 34 cols
 6. [split     ] train_test_split (80/20)                     [218ms]
 7. [preprocess] StandardScaler.fit_transform                 [201ms]
 8. [preprocess] StandardScaler.transform                      [17ms]
 9. [train     ] RandomForestClassifier.fit                 [88637ms]
10. [train     ] LogisticRegression.fit                      [1138ms]
11. [predict   ] RandomForestClassifier.predict               [332ms]
12. [predict   ] LogisticRegression.predict                     [4ms]
13. [predict   ] RandomForestClassifier.predict_proba         [311ms]
14. [evaluate  ] accuracy_score = 0.9995
15. [evaluate  ] precision_score = 0.8824
16. [evaluate  ] recall_score = 0.7895
17. [evaluate  ] f1_score = 0.8333
18. [evaluate  ] roc_auc_score = 0.9871

24 clean records. Zero noise. End-to-end trace from CSV to metrics.


Architecture

Plugin-based. Each library is a single file implementing BaseHookProvider. Adding new libraries requires ~200 lines and zero changes to the core.

User Code (unchanged)
        |
Hook Providers (pandas | sklearn | pyspark | ...)
        |
UnifiedTracker + TransformationRecord
        |
LineageAnalyzer -> anomalies, root causes, DAGs

Performance

Benchmarked on a 37-operation pipeline (50K rows, pandas + sklearn):

Condition Wall time
Without AutoLineage 0.050s
With AutoLineage 0.054s
Overhead 6.1% (0.08ms per operation)

Limitations

  • Single-process. Pipelines spanning multiple machines require manual trace correlation. OpenTelemetry export is planned.
  • Monkey-patching is version-sensitive. Tested against pandas 2.x/3.x, scikit-learn 1.x, PySpark 3.x/4.x.
  • Python-only. R, Julia, Java are out of scope.
  • In-memory records. Long notebook sessions accumulate state.

Contributing

Add a new library in 5 steps:

  1. Create autolineage/hooks/your_lib_hooks.py
  2. Subclass BaseHookProvider
  3. Implement install(tracker) and uninstall()
  4. Add to the registry in hooks/registry.py
  5. Open a PR

See hooks/pandas_io.py for the smallest working example (~110 LoC).


Development

git clone https://github.com/kishanraj41/autolineage
cd autolineage
pip install -e ".[dev]"
pytest tests/test_v2.py -v     # 36 tests

License

MIT


Citation

@misc{vandhavasi2026autolineage,
  title={AutoLineage: Zero-Code End-to-End Data Lineage for ML Pipelines},
  author={Vandhavasi, Kishan Raj},
  year={2026},
  archivePrefix={arXiv},
  primaryClass={cs.SE}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autolineage-0.3.0.tar.gz (57.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autolineage-0.3.0-py3-none-any.whl (55.0 kB view details)

Uploaded Python 3

File details

Details for the file autolineage-0.3.0.tar.gz.

File metadata

  • Download URL: autolineage-0.3.0.tar.gz
  • Upload date:
  • Size: 57.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.3.0.tar.gz
Algorithm Hash digest
SHA256 78a90f1443c237f7d96c0ddf195798c2c7646325305777c0a735e275d700e2d5
MD5 980efbb6d7af727a3ddbf53c76f2dc2a
BLAKE2b-256 b5133a36ecbe84fa727c693d4349b650d83db8748b549fa64f13b55b9d10db4f

See more details on using hashes here.

File details

Details for the file autolineage-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: autolineage-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 55.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for autolineage-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c514b87e7992b4a4d1c0f32e3684c37cc9d12733a6e963f96c2111146d12b4de
MD5 b15a0e7268436ea9dca4a9cf40483825
BLAKE2b-256 064291c9c06b69547300ac1ef2ba4c0f013c61ee61facba049eeb586a9b6feeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page