Automatic ML data lineage tracking with zero manual logging
Project description
AutoLineage
Zero-code data lineage for Python ML pipelines.
AutoLineage automatically records every DataFrame operation, model training step, and metric evaluation across pandas, scikit-learn, and PySpark. One import activates 288 hooks. No decorators, no wrapper classes, no configuration files.
import autolineage.auto # that's the whole setup
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
df = pd.read_csv("data.csv")
df = df.dropna()
X, y = df.drop(columns=['target']), df['target']
model = RandomForestClassifier().fit(X, y)
preds = model.predict(X_test)
score = f1_score(y_test, preds)
# AutoLineage has now tracked every operation above into one unified DAG.
Why AutoLineage?
ML pipelines fail silently. A model whose F1 drops from 0.95 to 0.60 invites hours of print(df.shape) debugging. Existing tools either require explicit instrumentation (MLflow), track only files (DVC), or cover only a single stage (Evidently, Arize). No existing tool records the complete path from read_csv through f1_score in one graph automatically.
AutoLineage closes that gap.
Compared to other tools
| Capability | AutoLineage | MLflow | Evidently | OpenLineage | DataLineagePy |
|---|---|---|---|---|---|
| Zero code changes | Yes | No | No | No | No (wrapper) |
| Operation-level | Yes | No | No | Job-level | Yes |
| Cross-framework | pandas + sklearn + PySpark | — | — | Spark only | pandas only |
| End-to-end trace | Yes | No | No | No | No |
| Anomaly detection | Yes | No | Drift only | No | No |
| Root-cause localization | Yes | No | No | No | No |
Installation
pip install autolineage
Quick Start
1. Automatic tracking (one line)
import autolineage.auto
# Use pandas and sklearn normally
import pandas as pd
df = pd.read_csv("iris.csv")
df = df.dropna().drop_duplicates()
# See what happened
from autolineage.auto import get_tracker
tracker = get_tracker()
for rec in tracker.records:
print(f"{rec.operation}: {rec.input_shape} -> {rec.output_shape}")
2. Anomaly detection
from autolineage.core.analyzer import LineageAnalyzer
analyzer = LineageAnalyzer(tracker)
anomalies = analyzer.detect_anomalies()
for a in anomalies:
print(f"[{a.severity}] {a.message}")
# [critical] filter removed 99.9% of rows (100000 -> 50)
# [critical] f1_score = 0.0 (model may not be learning)
3. Root-cause localization
cause = analyzer.localize_root_cause(metric_name="accuracy")
print(cause.explanation)
# "The most likely cause of accuracy degradation is 'filter' at step 3.
# Row change was -99,950 (baseline: -2,100)."
4. Save a fingerprint for future comparison
# After a healthy run
analyzer.save_fingerprint("baseline.json")
# On the next run
analyzer.load_baseline("baseline.json")
anomalies = analyzer.detect_anomalies() # compared to baseline
What Gets Tracked
pandas (64 hooks): read_csv, to_csv, dropna, fillna, merge, concat, groupby + aggregations, drop_duplicates, filter, assign, sort_values, pivot_table, melt, plus 40+ more.
scikit-learn (175 hooks): train_test_split, estimator fit/predict/predict_proba/score (RandomForest, LogisticRegression, DecisionTree, SVC, KNN, etc.), 18 preprocessor classes, 15 metric functions.
PySpark (49 hooks): DataFrame transforms, groupBy aggregations, join variants, reader/writer methods, actions.
Example Output
On a 284K-row credit card fraud detection pipeline:
1. [io ] read_csv -> (284807, 31) [1280ms]
2. [transform ] drop_duplicates (-1,081 rows) [827ms]
3. [transform ] filter (-284 rows)
4. [transform ] assign -> 36 cols [1ms]
5. [transform ] select -> 34 cols
6. [split ] train_test_split (80/20) [218ms]
7. [preprocess] StandardScaler.fit_transform [201ms]
8. [preprocess] StandardScaler.transform [17ms]
9. [train ] RandomForestClassifier.fit [88637ms]
10. [train ] LogisticRegression.fit [1138ms]
11. [predict ] RandomForestClassifier.predict [332ms]
12. [predict ] LogisticRegression.predict [4ms]
13. [predict ] RandomForestClassifier.predict_proba [311ms]
14. [evaluate ] accuracy_score = 0.9995
15. [evaluate ] precision_score = 0.8824
16. [evaluate ] recall_score = 0.7895
17. [evaluate ] f1_score = 0.8333
18. [evaluate ] roc_auc_score = 0.9871
24 clean records. Zero noise. End-to-end trace from CSV to metrics.
Architecture
Plugin-based. Each library is a single file implementing BaseHookProvider. Adding new libraries requires ~200 lines and zero changes to the core.
User Code (unchanged)
|
Hook Providers (pandas | sklearn | pyspark | ...)
|
UnifiedTracker + TransformationRecord
|
LineageAnalyzer -> anomalies, root causes, DAGs
Performance
Benchmarked on a 37-operation pipeline (50K rows, pandas + sklearn):
| Condition | Wall time |
|---|---|
| Without AutoLineage | 0.050s |
| With AutoLineage | 0.054s |
| Overhead | 6.1% (0.08ms per operation) |
Limitations
- Single-process. Pipelines spanning multiple machines require manual trace correlation. OpenTelemetry export is planned.
- Monkey-patching is version-sensitive. Tested against pandas 2.x/3.x, scikit-learn 1.x, PySpark 3.x/4.x.
- Python-only. R, Julia, Java are out of scope.
- In-memory records. Long notebook sessions accumulate state.
Contributing
Add a new library in 5 steps:
- Create
autolineage/hooks/your_lib_hooks.py - Subclass
BaseHookProvider - Implement
install(tracker)anduninstall() - Add to the registry in
hooks/registry.py - Open a PR
See hooks/pandas_io.py for the smallest working example (~110 LoC).
Development
git clone https://github.com/kishanraj41/autolineage
cd autolineage
pip install -e ".[dev]"
pytest tests/test_v2.py -v # 36 tests
License
MIT
Citation
@misc{vandhavasi2026autolineage,
title={AutoLineage: Zero-Code End-to-End Data Lineage for ML Pipelines},
author={Vandhavasi, Kishan Raj},
year={2026},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autolineage-0.3.0.tar.gz.
File metadata
- Download URL: autolineage-0.3.0.tar.gz
- Upload date:
- Size: 57.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78a90f1443c237f7d96c0ddf195798c2c7646325305777c0a735e275d700e2d5
|
|
| MD5 |
980efbb6d7af727a3ddbf53c76f2dc2a
|
|
| BLAKE2b-256 |
b5133a36ecbe84fa727c693d4349b650d83db8748b549fa64f13b55b9d10db4f
|
File details
Details for the file autolineage-0.3.0-py3-none-any.whl.
File metadata
- Download URL: autolineage-0.3.0-py3-none-any.whl
- Upload date:
- Size: 55.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c514b87e7992b4a4d1c0f32e3684c37cc9d12733a6e963f96c2111146d12b4de
|
|
| MD5 |
b15a0e7268436ea9dca4a9cf40483825
|
|
| BLAKE2b-256 |
064291c9c06b69547300ac1ef2ba4c0f013c61ee61facba049eeb586a9b6feeb
|