Structured comparison reports for tabular data in Python — modeled on SAS PROC COMPARE.
Project description
proccompy
Structured comparison reports for tabular data in Python.
If you've ever needed to prove that two DataFrames hold the same data — after a pipeline rewrite, between source and ETL output, across a database migration, or in a daily regression check — you've probably found that pandas.DataFrame.compare() shows you cell-level diffs, assert_frame_equal gives you pass/fail, and neither tells you the thing you actually want to know: where do they differ, by how much, in which direction, and does it matter?
proccompy produces a structured diagnostic report that answers those questions in the order you'd ask them. It's modeled on SAS PROC COMPARE — a 40-year-old procedure that data analysts trust because it makes "what changed?" a 30-second scan instead of an afternoon of investigation.
import polars as pl
import proccompy as pc
from proccompy import T
base = pl.read_parquet("production.parquet")
rewrite = pl.read_parquet("new_implementation.parquet")
result = pc.compare(
base, rewrite,
id_columns="account_id",
tolerances={
"dollar_amount": T.absolute(0.01), # rounding-level diffs are OK
"rate": T.percent(0.001), # but >0.1% relative is not
},
)
result.to_lst("validation.lst") # SAS-style listing for archival / review
result.assert_matches() # raises AssertionError if they don't — for CI
Why this exists
The Python ecosystem has many ways to find differences between two DataFrames. What it doesn't have is a structured report that helps you interpret them.
pandas.DataFrame.compare()returns a multi-index DataFrame of cell-level diffs. No summary, no row-presence breakdown, no statistics.pandas.testing.assert_frame_equalraises on first difference. Tells you something is wrong, not what.datacompyhas a report, but it's a flat text dump optimized for a single screen. No per-column difference statistics, no SAS conventions, no.lstoutput, no streaming/out-of-core support.- Schema-validation tools (Great Expectations, Pandera) check one frame against rules. They don't compare two frames.
The result: every team building a data pipeline ends up writing their own ad-hoc comparison script. That script's report is always a little worse than PROC COMPARE's because PROC COMPARE has had four decades of refinement.
proccompy brings that report to Python natively — running on polars, pandas, or DuckDB; pluggable into pytest; producing .lst files SAS analysts already know how to read.
Who is this for
Data engineers doing pipeline validation. You're rewriting an implementation, migrating a database, or proving that yesterday's output matches today's. You want tolerance per column (because floating-point drift isn't "wrong"), the differences summarized and broken down, and a structured object you can assert on in CI.
ETL teams comparing output to source. You need a recurring check that your pipeline faithfully reproduces source data, with the diagnostic depth to tell rounding drift from real corruption.
SAS analysts who miss the PROC COMPARE listing. You moved to Python (willingly or not) and want a comparison report your boss will accept — .lst format, with the same sections you've read for 20 years.
Testing frameworks for data pipelines. You're building (or using) dbt tests, custom assertion harnesses, regression suites. proccompy gives you a clean programmatic CompareResult to embed: structured fields, pytest-friendly.
What the report tells you
A real PROC COMPARE listing isn't arbitrary — it's a diagnostic protocol. Each section answers a question, in the order you'd ask it:
- Dataset Summary — are we comparing what I think we're comparing?
- Variables Summary — do the schemas match?
- Observation Summary — do the same keys exist on both sides?
- Values Comparison Summary — how broken is this, in one glance?
- Variables with Unequal Values — which columns are responsible?
- Statistics for Numeric Differences — bias direction, spread, magnitude
- Comparison Results for Observations — show me the actual numbers
You can scan a 50-page .lst in 30 seconds and decide whether to investigate or close the ticket. That's the value.
Install
pip install proccompy
Requires Python 3.10+. Pulls in duckdb and polars. Add pandas only if you need it as input.
Usage
The basic comparison
import polars as pl
import proccompy as pc
base = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.0, 30.0]})
compare = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.5, 30.0]})
result = pc.compare(base, compare, id_columns="id")
result.matches # → False
result.summary() # → per-column polars DataFrame
print(result.report()) # → terminal-style report
Inputs can be polars DataFrames, pandas DataFrames, or DuckDB relations.
Tolerances per column
from proccompy import T
result = pc.compare(
base, compare,
id_columns="account_id",
tolerances={
"dollar_amount": T.absolute(0.01), # within 1 cent → equal
"rate": T.percent(0.001), # within 0.1% relative → equal
"category": T.exact(), # exact match required (default)
},
)
Three tolerance types: T.exact(), T.absolute(value), T.percent(value). Per-column. Defaults to exact for columns you don't specify.
Multiple ID columns
result = pc.compare(
base, compare,
id_columns=["sector", "naics_code", "year"],
)
SAS-style .lst output
result.to_lst("validation.lst")
Produces a paginated, fixed-width, ASCII-only .lst file with form-feed page breaks and SAS-style headers. Looks like real PROC COMPARE output. Open it in any text editor; attach it to a ticket; share it with anyone who used to write SAS.
Assertion for CI
def test_pipeline_rewrite_matches_production():
base = pl.read_parquet("expected.parquet")
actual = run_pipeline()
pc.compare(base, actual, id_columns="id").assert_matches()
Raises AssertionError with structured per-column failure detail. Drops into any pytest suite.
The structured result
CompareResult exposes everything as DataFrames you can post-process:
result.summary() # per-column summary
result.unequal_rows() # both-side rows with at least one diff
result.base_only_rows() # rows only in base
result.compare_only_rows() # rows only in compare
result.diff_dataset() # SAS OUT= equivalent: BASE / COMPARE / DIF / PERCENT
result.joined() # full joined frame for custom analysis
Command line
proccompy installs a proccompy console script for shell-level use — no Python required:
# Simple: compare two parquet files, report to stdout, exit 0 or 1
proccompy expected.parquet actual.parquet --id account_id
# With tolerances and multiple ID columns
proccompy old.csv new.csv --id "sector,year" \
--tolerance "amount=abs:0.01" \
--tolerance "rate=pct:0.001"
# Pipe everything to disk for archival / CI artifacts
proccompy a.parquet b.parquet --id id \
--lst report.lst \
--summary-csv summary.csv \
--diff-parquet diffs.parquet \
--quiet
Exit codes: 0 if the datasets match (within any specified tolerances), 1 if they differ in any compared value, row presence, or column overlap. Use as a CI gate:
proccompy expected.parquet actual.parquet --id id --quiet && deploy.sh
Supports parquet, CSV, and TSV input; format is inferred from extension or set with --format. Run proccompy --help for the full reference.
Per-column difference statistics
For numeric columns, the summary includes diagnostic statistics computed over matched non-null pairs (SAS PROC COMPARE STATS semantics):
summary = result.summary()
# Columns: column, base_dtype, compare_dtype, n_compared, n_unequal,
# n_null_mismatch, max_abs_diff, mean_diff, std_diff, min_diff,
# n_base_greater, n_compare_greater
These let you distinguish three failure modes that max_abs_diff alone can't:
| mean_diff | std_diff | n_base_gt | n_compare_gt | Diagnosis |
|---|---|---|---|---|
+2.0 |
0.0 |
0 |
195 |
Constant positive offset (unit conversion?) |
0.0 |
5.1 |
10 |
10 |
No bias but high spread (real drift) |
+0.1 |
0.0 |
0 |
195 |
Small uniform drift, within tolerance |
Alternatives — when to use what
| Best for | Limitation | |
|---|---|---|
pandas.DataFrame.compare() |
Quick cell-level diff in a notebook | No summary, no report, no tolerance |
pandas.testing.assert_frame_equal |
Strict pass/fail in tests | Stops at first diff, no diagnostic depth |
datacompy |
Flat text report for one-screen review | Memory-bound, global-only tolerance, no .lst, no per-column diff stats |
| Great Expectations, Pandera | Schema/distribution validation on one frame | Don't compare two frames |
| proccompy | Structured diagnostic report for pipeline validation | Out-of-core via DuckDB engine only (Spark engine not yet implemented) |
If you need a one-line cell diff, use pandas.compare(). If you need a structured report you can attach to a PR, a Jira ticket, or a SAS analyst's email, use proccompy.
What's under the hood
- DuckDB engine. The comparison is a
FULL OUTER JOINkeyed onid_columnswith per-column unequal expressions. DuckDB streams this on disk, so you're not memory-bound even for hundreds of millions of rows. - Polars I/O. Frames register zero-copy via Arrow; results come back as polars DataFrames.
- Null-safe. Uses SQL
IS NOT DISTINCT FROMsemantics. No sentinel-string hacks. - Strict duplicate-key detection. If your ID columns aren't unique, you'll get a clear error with counts, not silent wrong answers.
- Apache 2.0 licensed.
Roadmap
- v0.4 — BY groups (stratified comparison summaries)
- v0.5 —
result.to_html()self-contained HTML report - v1.0 — pytest plugin, full documentation site, PyPI release
Stable: top-level pc.compare(), T.exact/absolute/percent, CompareResult accessor methods, .lst format, CLI flags and exit codes.
Subject to change: _joined internal structure, exact .lst column widths.
Contributing
Issues and PRs welcome. Run the test suite with pytest tests/. Currently 92 tests; please add one for any new behavior.
License
Apache 2.0. See LICENSE.
Built by Kulbhushan Sharma because the PROC COMPARE report was the SAS feature I missed most after switching to Python.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file proccompy-0.4.2.tar.gz.
File metadata
- Download URL: proccompy-0.4.2.tar.gz
- Upload date:
- Size: 42.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92391e2c3ce1561a375410d44f98d5b4ed4de58a03667d8c51e21a99918b9a21
|
|
| MD5 |
45f5cc06e383e6390d09a321134a8dd0
|
|
| BLAKE2b-256 |
69c97b5ba2b44ca30b4adeef66788500b457687ce46ae6e798dca8d31ec0400f
|
File details
Details for the file proccompy-0.4.2-py3-none-any.whl.
File metadata
- Download URL: proccompy-0.4.2-py3-none-any.whl
- Upload date:
- Size: 32.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e63615e023ff76f0798e3bbbd9e2c4cb6638b2425ac570b833cdbfe10942133
|
|
| MD5 |
adfa6862ac85018fb4023c41c80025c9
|
|
| BLAKE2b-256 |
9f9be46a6e5c116dde8bb9b94550193524a5475f51ab759090ba79642bfb46b9
|