Skip to main content

Automatic causal tracing for Python DataFrame pipelines: find where nulls, rows, and dtypes silently changed.

Project description

dframe-trace

Find out where your data pipeline silently broke — without writing a single rule.

When you process data in pandas or polars, each step quietly reshapes it: a join introduces blank values, a filter drops rows you didn't expect, a cast turns whole numbers into decimals. These bugs don't crash your program — they just hand you wrong answers, often noticed far too late.

The usual fix is sprinkling print(df.shape) between every step and squinting at the output. dframe-trace automates that. Turn it on with one line, run your normal code, then ask questions afterward:

t.where_null_introduced("region")   # -> "merge_meta"   (the step that did it)
t.where_rows_lost()                 # -> [("filter", -1)]

No schemas, no rules, no upfront declarations. Run your code, then interrogate what happened.


Table of contents


How it's different

The Python data-validation space is crowded, so here's where dframe-trace fits.

Validation tools (Great Expectations, Pandera, Hamilton) check your data against rules you write in advance: "this column must never be null", "row count must stay above 1000". They're excellent, mature, and the right choice when you know your expectations.

dframe-trace is the opposite philosophy: zero rules. You declare nothing. It records what every step did to your data, and you ask after the fact where something changed. It's a debugging/observability tool, not a validation framework — closer to a profiler that tracks data shape across a whole pipeline than to a schema checker.

Use Pandera/GE when you know what "correct" looks like and want to enforce it. Use dframe-trace when something is already wrong and you need to find which step did it — or when you want a cheap always-on record of how data flows through a script.

The two are complementary; nothing stops you using both.

Install

pip install dframe-trace

dframe-trace itself has no required dependencies. You bring your own pandas and/or polars.

Quick start

Decorate each pipeline step, run inside a trace() block, then interrogate it:

from dframe_trace import traced, trace

@traced("merge_meta")
def merge_meta(df):
    return df.merge(meta, on="id", how="left")   # silently introduces nulls

@traced("filter")
def filter_rows(df):
    return df[df.amt > 15]                         # silently drops rows

with trace() as t:
    df = load(None)
    df = merge_meta(df)
    df = filter_rows(df)

print(t.where_null_introduced("region"))   # -> "merge_meta"
print(t.where_rows_lost())                 # -> [("filter", -1)]
print(t.report())

t.report() prints a readable step-by-step diff:

dframe-trace report
============================================================
[0] load  (0.5 ms)
    start: 4 rows, 2 cols
[1] merge_meta  (1.4 ms)
    +cols: ['region']
    nulls region: 0 -> 1  [WARN]
[2] filter  (0.4 ms)
    rows: -1

Frictionless mode (no decorators)

Don't want to touch your functions? Patch pandas/polars once and write ordinary code — every relevant call inside a trace() block is recorded automatically:

import pandas as pd
from dframe_trace import trace, autopatch

autopatch.install()   # one line at the top of your script

with trace() as t:
    df = raw.merge(meta, on="id", how="left")   # recorded automatically
    df = df.astype({"id": "float64"})            # recorded automatically
    df = df.dropna(subset=["region"])            # recorded automatically

print(t.report())
print(t.where_null_introduced("region"))   # -> "merge"

autopatch.uninstall()   # optional: restore original methods

autopatch wraps the methods that most often cause silent bugs. Outside an active trace() block the overhead is a single is None check, so it's safe to leave installed.

Use it as a CI gate

Turn a trace into a build-failing assertion in your test suite:

from dframe_trace import trace, guards

with trace() as t:
    run_pipeline()

guards.assert_no_new_nulls(t)                    # raises if a step added nulls
guards.assert_no_row_loss(t, allow={"filter"})   # allow expected row drops
guards.assert_no_silent_casts(t, allow={"astype"})

Each guard raises TraceAssertionError with a structured .violations list, so failures are precise: "merge introduced 2 null(s) in 'region'".

Works with polars too

dframe-trace is backend-agnostic. autopatch.install() patches whichever of pandas / polars is installed:

import polars as pl
from dframe_trace import trace, autopatch

autopatch.install()

with trace() as t:
    df = raw.join(meta, on="id", how="left")   # eager: recorded automatically
    df = df.drop_nulls(subset=["region"])

    out = (lf.filter(pl.col("amt") > 15)        # lazy: the chain builds a plan…
             .collect())                         # …and is recorded at .collect()

print(t.where_null_introduced("region"))   # -> "join"

Eager polars DataFrame methods (join, drop_nulls, fill_null, cast, filter, sort, unique, with_columns, select, …) are traced like pandas. For lazy LazyFrame pipelines, intermediate operations only build a query plan and can't be snapshotted cheaply, so tracing happens at the .collect() boundary where the plan materializes into a real frame.

API reference

trace() — context manager. Opens a recording session; yields a Trace.

@traced(name=None, note="") — decorator for a function whose first argument is a frame and which returns a frame. Records a before/after snapshot under name (defaults to the function name).

autopatch.install(pandas=True, polars=True) — monkeypatch DataFrame methods so calls record automatically. Idempotent; safe when a library is absent. autopatch.uninstall() restores originals. autopatch.is_installed() returns the current state.

Trace methods:

  • where_null_introduced(column) → name of the first step that added nulls to column, or None.
  • where_rows_lost() → list of (step_name, negative_delta) for steps that dropped rows.
  • report() → human-readable string of every step and what changed.
  • steps → the raw list of Step objects; each has .diff() returning a dict of rows_delta, cols_added, cols_dropped, dtype_changes, null_changes, mem_delta_bytes.

Guards (each raises guards.TraceAssertionError on violation):

  • assert_no_new_nulls(trace, columns=None)
  • assert_no_row_loss(trace, allow=None)
  • assert_no_silent_casts(trace, allow=None)

A snapshot is structural only — row count, column names, dtypes, per-column null counts, and estimated memory. No row values are ever copied or stored, which is why it's cheap enough to leave on.

Limitations (read before relying on it)

  • Boolean-mask filtering (df[df.x > 0]) is not auto-traced. That uses __getitem__, an operator we deliberately don't patch (too broad, too risky). The row loss still appears in the next recorded step's row delta, just not attributed to the filter itself. For precise attribution, wrap that function with @traced.
  • groupby is not yet traced. It returns a GroupBy object rather than a DataFrame; tracing its terminal .agg/.sum is on the roadmap.
  • polars support is newer than the pandas support. The pandas path is thoroughly tested; please run the polars test suite against your polars version (see below) and report issues.
  • This is a young project. It's a debugging aid, not a guarantee of correctness.

Requirements

  • Python 3.9+
  • pandas and/or polars (whichever you use; neither is installed by dframe-trace)

To run the tests locally:

pip install pandas polars pytest
pip install -e .
python -m pytest tests/ -v

The polars tests auto-skip if polars isn't installed.

Roadmap (good first issues for contributors)

  • groupby terminal-method tracing
  • HTML / Mermaid lineage diagram export from a Trace
  • More guards (e.g. assert_no_schema_change)

Contributing

Issues and pull requests welcome. Fork the repo, make your change with a test, and open a PR. Good first issues are tagged in the roadmap above.

License

MIT. See the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dframe_trace-0.3.0.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dframe_trace-0.3.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file dframe_trace-0.3.0.tar.gz.

File metadata

  • Download URL: dframe_trace-0.3.0.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for dframe_trace-0.3.0.tar.gz
Algorithm Hash digest
SHA256 eaa14cfaa34bce12f7db01b995e49f6c785d4b295d8289b1bb804cab1f4c6c5a
MD5 f338cbf12665d219dffe26d1335386a1
BLAKE2b-256 ccb95f2eded0ab4cca869b5d42c3203c320f975800d8d73105478a47fc4946a0

See more details on using hashes here.

File details

Details for the file dframe_trace-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: dframe_trace-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for dframe_trace-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8996cfeea190600b5f7d4ac64301264e891d5e941c57459c6b27257015ae49d2
MD5 a176ad21468dc97968175514441b4c00
BLAKE2b-256 715fd41a690533d2810c24b39e1dc7d77c3edb002a6d19a2e4dafee2cffbfe01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page