Skip to main content

Row-level data lineage tracking for pandas pipelines

Project description

TracePipe

Row-level data lineage for pandas pipelines

Know exactly where every row went, why values changed, and how your data transformed.

PyPI version Python 3.9+ CI codecov License: MIT Docs

Getting Started · Documentation · Examples


Why TracePipe?

Data pipelines are black boxes. Rows vanish. Values change. You're left guessing.

df = pd.read_csv("customers.csv")
df = df.dropna()                      # Some rows disappear
df = df.merge(regions, on="zip")      # New rows appear, some vanish
df["income"] = df["income"].fillna(0) # Values change silently
df = df[df["age"] >= 18]              # More rows gone
# What happened to customer C-789? 🤷

TracePipe gives you the complete audit trail — zero code changes required.


Getting Started

pip install tracepipe
import tracepipe as tp
import pandas as pd

tp.enable(mode="debug", watch=["income"])

df = pd.read_csv("customers.csv")
df = df.dropna()
df["income"] = df["income"].fillna(0)
df = df[df["age"] >= 18]

tp.check(df)  # See what happened
TracePipe Check: [OK] Pipeline healthy

Retention: 847/1000 (84.7%)
Dropped: 153 rows
  • DataFrame.dropna: 42
  • DataFrame.__getitem__[mask]: 111

Value changes: 23 cells modified
  • DataFrame.fillna: 23 (income)

That's it. One import, full visibility.


Core API

Function What it does
tp.enable() Start tracking
tp.check(df) Health check — retention, drops, changes
tp.trace(df, where={"id": "C-789"}) Follow a row's complete journey
tp.why(df, col="income", row=5) Explain why a cell has its current value
tp.report(df, "audit.html") Export interactive HTML report

Key Features

🔍 Zero-Code Instrumentation

TracePipe patches pandas at runtime. Your existing code works unchanged.

📊 Complete Provenance

Track drops, transforms, merges, and cell-level changes with before/after values.

🎯 Business-Key Lookups

Find rows by their values: tp.trace(df, where={"email": "alice@example.com"})

⚡ Production-Ready

1.0-2.8x overhead (varies by operation). Tested on DataFrames up to 1M rows.


Real-World Example

import tracepipe as tp
import pandas as pd

tp.enable(mode="debug", watch=["age", "income", "label"])

# Load and clean
df = pd.read_csv("training_data.csv")
df = df.dropna(subset=["label"])
df["income"] = df["income"].fillna(df["income"].median())
df = df[df["age"] >= 18]

# Audit
print(tp.check(df))
Retention: 8234/10000 (82.3%)
Dropped: 1766 rows
  • DataFrame.dropna: 423
  • DataFrame.__getitem__[mask]: 1343

Value changes: 892 cells
  • DataFrame.fillna: 892 (income)
# Why does this customer have a filled income?
tp.why(df, col="income", where={"customer_id": "C-789"})
Cell History: row 156, column 'income'
  Current value: 45000.0
  [i] Was null at step 1 (later recovered)

  History (1 change):
    None -> 45000.0
      by: DataFrame.fillna

Two Modes

Mode Use Case What's Tracked
CI (default) Production pipelines Step counts, retention rates, merge warnings
Debug Development Full row history, cell diffs, merge parents, group membership
tp.enable(mode="ci")     # Lightweight
tp.enable(mode="debug")  # Full lineage

What's Tracked

Operation Coverage
dropna, drop_duplicates, query, df[mask] ✅ Full
fillna, replace, loc[]=, iloc[]= ✅ Full (cell diffs)
merge, join ✅ Full (parent tracking)
groupby().agg() ✅ Full (group membership)
sort_values, head, tail, sample ✅ Full
apply, pipe ⚠️ Partial

Data Quality Contracts

(tp.contract()
    .expect_unique("customer_id")
    .expect_no_nulls("email")
    .expect_retention(min_rate=0.9)
    .check(df)
    .raise_if_failed())

Documentation

📚 Full Documentation


Known Limitations

TracePipe tracks cell mutations, merge provenance, concat provenance, and duplicate drop decisions reliably. A few patterns have limited tracking:

Pattern Status Notes
df["col"] = df["col"].fillna(0) ✅ Tracked Series + assignment
df = df.fillna({"col": 0}) ✅ Tracked DataFrame-level fillna
df.loc[mask, "col"] = val ✅ Tracked Conditional assignment
df.merge(other, on="key") ✅ Tracked Full provenance in debug mode
pd.concat([df1, df2]) ✅ Tracked Row IDs preserved with source DataFrame tracking (v0.4+)
df.drop_duplicates() ✅ Tracked Dropped rows map to kept representative (debug mode, v0.4+)
pd.concat(axis=1) ⚠️ Partial FULL only if all inputs have identical RIDs
Complex apply/pipe ⚠️ Partial Output tracked, internals opaque

Contributing

git clone https://github.com/gauthierpiarrette/tracepipe.git
cd tracepipe
pip install -e ".[dev]"
pytest tests/ -v

See CONTRIBUTING for guidelines.


License

MIT License. See LICENSE.


Stop guessing where your rows went.

pip install tracepipe

⭐ Star us on GitHub if TracePipe helps your data work!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tracepipe-0.4.2.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tracepipe-0.4.2-py3-none-any.whl (99.5 kB view details)

Uploaded Python 3

File details

Details for the file tracepipe-0.4.2.tar.gz.

File metadata

  • Download URL: tracepipe-0.4.2.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tracepipe-0.4.2.tar.gz
Algorithm Hash digest
SHA256 775724e6407a0795c6ac3c9f5c26ea269177f89183bb0150173361242cff44df
MD5 221894c235e3509952f2548d0a541dc3
BLAKE2b-256 456c9839e5e6ed00f88dd4b1afc0b4c1a0d90cf4d6bdb093b36e6f12eb2d599b

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracepipe-0.4.2.tar.gz:

Publisher: release.yml on gauthierpiarrette/tracepipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tracepipe-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: tracepipe-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 99.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for tracepipe-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c6e9dea600d77359571fc8c55ed8ac797736eddb152a2607327e392bead1d59a
MD5 bd186a2a85ed30783dea9ab9fa25ca9e
BLAKE2b-256 e5db2a5e6f83301266cf2160f9e42464305b344640288dbfc20ec64d92853abb

See more details on using hashes here.

Provenance

The following attestation bundles were made for tracepipe-0.4.2-py3-none-any.whl:

Publisher: release.yml on gauthierpiarrette/tracepipe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page