Catch silent row-count corruption in pandas pipelines at runtime, and see it as a flow diagram.
Project description
rowflow
Catch silent row-count corruption in pandas pipelines at runtime — in one import.
When a join on a "unique" key that isn't silently multiplies your rows, rowflow flags it at the
exact line and draws the flow.
Install · Quickstart · When to use it · Comparison · Benchmarks · Limitations
Have you ever…
- joined a dimension table you assumed was unique — and quietly turned 1,000 rows into 4,700, so every total downstream was wrong?
- watched a revenue number, a row count, or a mean come out too high and only later traced it to
a
mergethat fanned out on a duplicated key? - reached for pandas'
validate=— and realised you'd have to remember to add it, with the right cardinality, to every single join?
That's a many-to-many join explosion: a key duplicated on both sides multiplies rows. It's quiet — the code runs fine, the DataFrame looks plausible — and you usually find out after a wrong report has already gone out. rowflow catches it while your code is still running, and shows you the flow.
A bit of context. This is one of a handful of small tools I'm putting out — each one a problem I ran into on my own data work, and the fix I wish I'd had on hand. I wrote rowflow in an evening; it isn't trying to be everything. But a silent join explosion has cost me real hours of "why is this total wrong?" more than once, so here it is in case it saves you some. It's narrow on purpose, and honest about where it stops (the limits are spelled out below).
In examples/one_import.py, one duplicated key in a region table inflates
a revenue total from an honest 140 to a wrong 240. rowflow flags the offending merge and
the exact line; pandas raises nothing.
Install
pip install rowflow # core (pandas only)
pip install "rowflow[viz]" # interactive Sankey diagram (plotly)
pip install "rowflow[rich]" # prettier terminal reports
Requires Python ≥ 3.10. To try it straight from a clone (no install needed), run from the source tree:
git clone https://github.com/Tommasoaiello13/rowflow && cd rowflow
PYTHONPATH=src python examples/one_import.py
Quickstart
One line at the top of any script or notebook watches the whole run and writes rowflow.html
(the Sankey) when it ends:
import pandas as pd
import rowflow.auto
orders = orders.merge(customers, on="customer_id") # clean one-to-many — fine
enriched = orders.merge(regions, on="region_id") # region_id duplicated -> EXPLOSION, flagged
The Sankey lands in
rowflow.htmlin the working directory — setROWFLOW_HTML_PATHto move it, orROWFLOW_DISABLE=1to switch auto mode off. It's a generated file, so add it to your.gitignore.
Or scope it to a block and get the findings back:
import rowflow
with rowflow.guard() as run:
out = customers.merge(orders, on="customer_id")
run.render("flow.html") # the Sankey of what happened
Gate your CI. In tests, the bundled fixture fails the build on any explosion:
def test_pipeline(no_row_explosion): # provided fixture
build_report()
Or, in a plain pipeline script (no pytest), make an explosion a hard error:
import rowflow
rowflow.install()
rowflow.configure(policy="raise") # raises RowExplosionError on the first explosion
run_my_pipeline()
If a many-to-many join is intentional, declare it the idiomatic pandas way and rowflow stays silent:
left.merge(right, on="k", validate="many_to_many") # intent declared -> no warning
When should you use it?
| Use it for… | Why |
|---|---|
| ETL / reporting pipelines with several joins | the silent fan-out that corrupts totals is exactly what it catches, at the line |
| Notebooks doing ad-hoc joins | one import, a flow diagram at the end, no scaffolding |
| A CI gate on a data pipeline | the no_row_explosion fixture fails the build if a join explodes |
| Onboarding / teaching | shows where rows multiplied, with a one-line fix |
And when not to bother: if every join in your codebase already passes an explicit validate=,
rowflow has nothing to add — it just stays quiet (zero false positives). What it adds is catching the
joins where someone forgot to, with no per-call ceremony and a picture of the run.
How it works
rowflow wraps pandas' merge / DataFrame.merge / DataFrame.join at runtime and, for each call,
records how many rows flowed in and out, plus the exact call site. A real explosion is a
many-to-many fan-out — a key value duplicated on both sides — and it is confirmed in two cheap,
sound stages:
- an O(1) gate — only a join whose output exceeds its larger input is a candidate, so ordinary 1:1 / 1:many / many:1 joins cost nothing beyond a length comparison;
- a key-cardinality check confirms a true many-to-many, ruling out a legitimate one-to-many join (duplicates on one side only) and a disjoint-key outer union (rows grow, no shared duplicate).
It never mutates your data, never changes a return value, and never raises out of its own hooks — instrumented code behaves identically to uninstrumented code. Non-pandas backends (Modin, cuDF, Polars) are left untouched.
How it compares
rowflow watches the join itself at runtime — the in-vs-out cardinality of the operation. Schema
validators inspect a frame in isolation; lineage tools track where columns came from; the
validate= argument is a per-call opt-in. None of them is a zero-config runtime guard with a flow
picture.
| rowflow | pandas validate= |
pandera / Great Expectations | dbt tests | datalineagepy | |
|---|---|---|---|---|---|
| Axis | join cardinality (correctness) | join cardinality | frame schema/values | warehouse data tests | column provenance |
| Setup | 1 import | a kwarg on every join | author a schema | a dbt project | wrap your frames |
| Catches silent join fan-out | ✅ | ✅ if you remember it | ❌ (frame looks valid) | ⚠️ post-hoc | ❌ |
| Compares rows in vs out | ✅ | n/a | ❌ | ❌ | ❌ |
| Zero-config, runtime | ✅ | ❌ (opt-in) | ❌ | ❌ | ⚠️ |
| Points at the exact line | ✅ | ✅ (raises) | n/a | ❌ | ❌ |
| Visual flow diagram | ✅ | ❌ | ❌ | ❌ | ✅ (lineage, not correctness) |
Where rowflow wins: one import, it runs live, it points at the exact line, and it draws the flow. Where it doesn't: it isn't a schema validator (pandera/GE check column types and value ranges it has no opinion on), and it isn't lineage/governance. Treat it as complementary to these tools, not a replacement.
Benchmarks & analysis
All figures are produced from live runs by tools/make_figures.py and the
KPIs by validation/kpi.py — no hardcoded numbers
(full results).
Accuracy. On a randomized corpus: 100% recall on realistic explosions, 0% false positives
across the join shapes a naive row-count rule gets wrong (1:1, 1:many, many:1, 1:many left join,
disjoint outer union), and 100% suppression when validate= declares intent.
See the explosion. rowflow renders the run as a Sankey (interactive HTML via rowflow[viz]); the
static view marks the offending step in red:
Cost. The O(1) gate keeps the key check off the happy path, so overhead is the wrapper bookkeeping — sub-millisecond per merge, a single-digit percentage at 100k rows and shrinking with size:
What it does NOT detect, and why
rowflow flags fan-outs that actually inflate the result. Stated plainly:
| Not detected | Why | Use instead |
|---|---|---|
| A many-to-many masked by row loss (net rows don't grow) | it stays below the O(1) gate; rowflow targets fan-outs that corrupt totals | an explicit validate= on that join |
Row changes outside merge / join (concat, dropna, filtering) |
not wrapped yet — usually intended | an assertion on the row count |
| Intentional many-to-many (a deliberate cross/expand join) | flagged by default; it can't read your intent | pass validate="many_to_many", or rowflow.configure(min_fanout_ratio=…) |
| Non-pandas backends (Modin, cuDF, Polars) | only pandas is patched (a safe no-op elsewhere) | — |
Silent inner-join row loss is also detectable, opt-in via rowflow.configure(detect_loss=True)
(off by default to keep zero false positives). rowflow is a coverage-bounded detector of
materialised row-count corruption — like a passing test, not a proof.
References
- pandas #2690 — combinatorial explosion when merging dataframes. https://github.com/pandas-dev/pandas/issues/2690
- pandas —
mergevalidate=parameter. https://pandas.pydata.org/docs/reference/api/pandas.merge.html - Merge, join, concatenate and compare (pandas user guide). https://pandas.pydata.org/docs/user_guide/merging.html
- datalineagepy — column-level pandas lineage (a different axis: provenance, not correctness). https://pypi.org/project/datalineagepy/
Contributing & contact
Issues and pull requests are very welcome — start with CONTRIBUTING.md and the
Code of Conduct. Good places to start are observers for concat / dropna /
filtering, an opt-in strict key-scan mode (to catch the masked-fan-out boundary), or richer Sankey
rendering. And if rowflow ever misses an explosion it should have caught — or fires on a join that's
actually fine — please open an issue with a small reproducer; those are the reports I value most. You
can also reach me on LinkedIn.
License
MIT © 2026 Tommaso Aiello — free to use, modify, and distribute (including commercially); keep the copyright notice; provided "as is", without warranty.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rowflow-0.1.0.tar.gz.
File metadata
- Download URL: rowflow-0.1.0.tar.gz
- Upload date:
- Size: 154.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22e4486c961c4c8cc1ff67ecc037827c2e9bc1f78d384755d5cf7bfcd3c82e8a
|
|
| MD5 |
7405c087e1f28480bf196c10973be0f0
|
|
| BLAKE2b-256 |
0b1ecc90297fcb5cfb709f5cca40a2d96e8ed461c398b52f929d4d0df3ccaea3
|
Provenance
The following attestation bundles were made for rowflow-0.1.0.tar.gz:
Publisher:
publish.yml on Tommasoaiello13/rowflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rowflow-0.1.0.tar.gz -
Subject digest:
22e4486c961c4c8cc1ff67ecc037827c2e9bc1f78d384755d5cf7bfcd3c82e8a - Sigstore transparency entry: 1985393982
- Sigstore integration time:
-
Permalink:
Tommasoaiello13/rowflow@67bf20ad35f420fb75b2c94f20c946be5a2576a0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Tommasoaiello13
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@67bf20ad35f420fb75b2c94f20c946be5a2576a0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file rowflow-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rowflow-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99a3681f6b990ed9e3fad42d392893abc5471bda944717bfcc6895ee8864bd6f
|
|
| MD5 |
857c423cbec686e011a9fad7142aa664
|
|
| BLAKE2b-256 |
f33e71b135cb2ae89b18f2c10c63163fd9480b3f9dee7a99e9b65b8b119acce7
|
Provenance
The following attestation bundles were made for rowflow-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Tommasoaiello13/rowflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rowflow-0.1.0-py3-none-any.whl -
Subject digest:
99a3681f6b990ed9e3fad42d392893abc5471bda944717bfcc6895ee8864bd6f - Sigstore transparency entry: 1985394034
- Sigstore integration time:
-
Permalink:
Tommasoaiello13/rowflow@67bf20ad35f420fb75b2c94f20c946be5a2576a0 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Tommasoaiello13
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@67bf20ad35f420fb75b2c94f20c946be5a2576a0 -
Trigger Event:
release
-
Statement type: