Skip to main content

Catch silent row-count corruption in pandas pipelines at runtime, and see it as a flow diagram.

Project description

rowflow

rowflow

License: MIT Python 3.10+ Built with pandas Code style: Ruff CI

Catch silent row-count corruption in pandas pipelines at runtime — in one import.
When a join on a "unique" key that isn't silently multiplies your rows, rowflow flags it at the exact line and draws the flow.

Install · Quickstart · When to use it · Comparison · Benchmarks · Limitations


Have you ever…

  • joined a dimension table you assumed was unique — and quietly turned 1,000 rows into 4,700, so every total downstream was wrong?
  • watched a revenue number, a row count, or a mean come out too high and only later traced it to a merge that fanned out on a duplicated key?
  • reached for pandas' validate= — and realised you'd have to remember to add it, with the right cardinality, to every single join?

That's a many-to-many join explosion: a key duplicated on both sides multiplies rows. It's quiet — the code runs fine, the DataFrame looks plausible — and you usually find out after a wrong report has already gone out. rowflow catches it while your code is still running, and shows you the flow.

A bit of context. This is one of a handful of small tools I'm putting out — each one a problem I ran into on my own data work, and the fix I wish I'd had on hand. I wrote rowflow in an evening; it isn't trying to be everything. But a silent join explosion has cost me real hours of "why is this total wrong?" more than once, so here it is in case it saves you some. It's narrow on purpose, and honest about where it stops (the limits are spelled out below).

A duplicated key inflated a revenue total from 140 to 240

In examples/one_import.py, one duplicated key in a region table inflates a revenue total from an honest 140 to a wrong 240. rowflow flags the offending merge and the exact line; pandas raises nothing.

Install

pip install rowflow              # core (pandas only)
pip install "rowflow[viz]"       # interactive Sankey diagram (plotly)
pip install "rowflow[rich]"      # prettier terminal reports

Requires Python ≥ 3.10. To try it straight from a clone (no install needed), run from the source tree:

git clone https://github.com/Tommasoaiello13/rowflow && cd rowflow
PYTHONPATH=src python examples/one_import.py

Quickstart

One line at the top of any script or notebook watches the whole run and writes rowflow.html (the Sankey) when it ends:

import pandas as pd
import rowflow.auto

orders   = orders.merge(customers, on="customer_id")   # clean one-to-many — fine
enriched = orders.merge(regions,   on="region_id")     # region_id duplicated -> EXPLOSION, flagged

The Sankey lands in rowflow.html in the working directory — set ROWFLOW_HTML_PATH to move it, or ROWFLOW_DISABLE=1 to switch auto mode off. It's a generated file, so add it to your .gitignore.

Or scope it to a block and get the findings back:

import rowflow

with rowflow.guard() as run:
    out = customers.merge(orders, on="customer_id")
run.render("flow.html")          # the Sankey of what happened

Gate your CI. In tests, the bundled fixture fails the build on any explosion:

def test_pipeline(no_row_explosion):   # provided fixture
    build_report()

Or, in a plain pipeline script (no pytest), make an explosion a hard error:

import rowflow
rowflow.install()
rowflow.configure(policy="raise")      # raises RowExplosionError on the first explosion
run_my_pipeline()

If a many-to-many join is intentional, declare it the idiomatic pandas way and rowflow stays silent:

left.merge(right, on="k", validate="many_to_many")   # intent declared -> no warning

When should you use it?

Use it for… Why
ETL / reporting pipelines with several joins the silent fan-out that corrupts totals is exactly what it catches, at the line
Notebooks doing ad-hoc joins one import, a flow diagram at the end, no scaffolding
A CI gate on a data pipeline the no_row_explosion fixture fails the build if a join explodes
Onboarding / teaching shows where rows multiplied, with a one-line fix

And when not to bother: if every join in your codebase already passes an explicit validate=, rowflow has nothing to add — it just stays quiet (zero false positives). What it adds is catching the joins where someone forgot to, with no per-call ceremony and a picture of the run.

How it works

rowflow wraps pandas' merge / DataFrame.merge / DataFrame.join at runtime and, for each call, records how many rows flowed in and out, plus the exact call site. A real explosion is a many-to-many fan-out — a key value duplicated on both sides — and it is confirmed in two cheap, sound stages:

  1. an O(1) gate — only a join whose output exceeds its larger input is a candidate, so ordinary 1:1 / 1:many / many:1 joins cost nothing beyond a length comparison;
  2. a key-cardinality check confirms a true many-to-many, ruling out a legitimate one-to-many join (duplicates on one side only) and a disjoint-key outer union (rows grow, no shared duplicate).

It never mutates your data, never changes a return value, and never raises out of its own hooks — instrumented code behaves identically to uninstrumented code. Non-pandas backends (Modin, cuDF, Polars) are left untouched.

How it compares

rowflow watches the join itself at runtime — the in-vs-out cardinality of the operation. Schema validators inspect a frame in isolation; lineage tools track where columns came from; the validate= argument is a per-call opt-in. None of them is a zero-config runtime guard with a flow picture.

rowflow pandas validate= pandera / Great Expectations dbt tests datalineagepy
Axis join cardinality (correctness) join cardinality frame schema/values warehouse data tests column provenance
Setup 1 import a kwarg on every join author a schema a dbt project wrap your frames
Catches silent join fan-out if you remember it ❌ (frame looks valid) ⚠️ post-hoc
Compares rows in vs out n/a
Zero-config, runtime ❌ (opt-in) ⚠️
Points at the exact line ✅ (raises) n/a
Visual flow diagram ✅ (lineage, not correctness)

Where rowflow wins: one import, it runs live, it points at the exact line, and it draws the flow. Where it doesn't: it isn't a schema validator (pandera/GE check column types and value ranges it has no opinion on), and it isn't lineage/governance. Treat it as complementary to these tools, not a replacement.

Benchmarks & analysis

All figures are produced from live runs by tools/make_figures.py and the KPIs by validation/kpi.py — no hardcoded numbers (full results).

Accuracy. On a randomized corpus: 100% recall on realistic explosions, 0% false positives across the join shapes a naive row-count rule gets wrong (1:1, 1:many, many:1, 1:many left join, disjoint outer union), and 100% suppression when validate= declares intent.

See the explosion. rowflow renders the run as a Sankey (interactive HTML via rowflow[viz]); the static view marks the offending step in red:

rowflow flow: the second merge explodes

Cost. The O(1) gate keeps the key check off the happy path, so overhead is the wrapper bookkeeping — sub-millisecond per merge, a single-digit percentage at 100k rows and shrinking with size:

rowflow overhead vs input size

What it does NOT detect, and why

rowflow flags fan-outs that actually inflate the result. Stated plainly:

Not detected Why Use instead
A many-to-many masked by row loss (net rows don't grow) it stays below the O(1) gate; rowflow targets fan-outs that corrupt totals an explicit validate= on that join
Row changes outside merge / join (concat, dropna, filtering) not wrapped yet — usually intended an assertion on the row count
Intentional many-to-many (a deliberate cross/expand join) flagged by default; it can't read your intent pass validate="many_to_many", or rowflow.configure(min_fanout_ratio=…)
Non-pandas backends (Modin, cuDF, Polars) only pandas is patched (a safe no-op elsewhere)

Silent inner-join row loss is also detectable, opt-in via rowflow.configure(detect_loss=True) (off by default to keep zero false positives). rowflow is a coverage-bounded detector of materialised row-count corruption — like a passing test, not a proof.

References

Contributing & contact

Issues and pull requests are very welcome — start with CONTRIBUTING.md and the Code of Conduct. Good places to start are observers for concat / dropna / filtering, an opt-in strict key-scan mode (to catch the masked-fan-out boundary), or richer Sankey rendering. And if rowflow ever misses an explosion it should have caught — or fires on a join that's actually fine — please open an issue with a small reproducer; those are the reports I value most. You can also reach me on LinkedIn.

License

MIT © 2026 Tommaso Aiello — free to use, modify, and distribute (including commercially); keep the copyright notice; provided "as is", without warranty.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rowflow-0.1.0.tar.gz (154.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rowflow-0.1.0-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file rowflow-0.1.0.tar.gz.

File metadata

  • Download URL: rowflow-0.1.0.tar.gz
  • Upload date:
  • Size: 154.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rowflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 22e4486c961c4c8cc1ff67ecc037827c2e9bc1f78d384755d5cf7bfcd3c82e8a
MD5 7405c087e1f28480bf196c10973be0f0
BLAKE2b-256 0b1ecc90297fcb5cfb709f5cca40a2d96e8ed461c398b52f929d4d0df3ccaea3

See more details on using hashes here.

Provenance

The following attestation bundles were made for rowflow-0.1.0.tar.gz:

Publisher: publish.yml on Tommasoaiello13/rowflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rowflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rowflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rowflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99a3681f6b990ed9e3fad42d392893abc5471bda944717bfcc6895ee8864bd6f
MD5 857c423cbec686e011a9fad7142aa664
BLAKE2b-256 f33e71b135cb2ae89b18f2c10c63163fd9480b3f9dee7a99e9b65b8b119acce7

See more details on using hashes here.

Provenance

The following attestation bundles were made for rowflow-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Tommasoaiello13/rowflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page