Structured comparison reports for tabular data in Python — modeled on SAS PROC COMPARE.

These details have not been verified by PyPI

Project links

Project description

proccompy

Structured comparison reports for tabular data in Python.

If you've ever needed to prove that two DataFrames hold the same data — after a pipeline rewrite, between source and ETL output, across a database migration, or in a daily regression check — you've probably found that pandas.DataFrame.compare() shows you cell-level diffs, assert_frame_equal gives you pass/fail, and neither tells you the thing you actually want to know: where do they differ, by how much, in which direction, and does it matter?

proccompy produces a structured diagnostic report that answers those questions in the order you'd ask them. It's modeled on SAS PROC COMPARE — a 40-year-old procedure that data analysts trust because it makes "what changed?" a 30-second scan instead of an afternoon of investigation.

import polars as pl
import proccompy as pc
from proccompy import T

base    = pl.read_parquet("production.parquet")
rewrite = pl.read_parquet("new_implementation.parquet")

result = pc.compare(
    base, rewrite,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # rounding-level diffs are OK
        "rate":          T.percent(0.001),   # but >0.1% relative is not
    },
)

result.to_lst("validation.lst")     # SAS-style listing for archival / review
result.assert_matches()              # raises AssertionError if they don't — for CI

Why this exists

The Python ecosystem has many ways to find differences between two DataFrames. What it doesn't have is a structured report that helps you interpret them.

pandas.DataFrame.compare() returns a multi-index DataFrame of cell-level diffs. No summary, no row-presence breakdown, no statistics.
pandas.testing.assert_frame_equal raises on first difference. Tells you something is wrong, not what.
datacompy has a report, but it's a flat text dump optimized for a single screen. No per-column difference statistics, no SAS conventions, no .lst output, no streaming/out-of-core support.
Schema-validation tools (Great Expectations, Pandera) check one frame against rules. They don't compare two frames.

The result: every team building a data pipeline ends up writing their own ad-hoc comparison script. That script's report is always a little worse than PROC COMPARE's because PROC COMPARE has had four decades of refinement.

proccompy brings that report to Python natively — running on polars, pandas, or DuckDB; pluggable into pytest; producing .lst files SAS analysts already know how to read.

Who is this for

Data engineers doing pipeline validation. You're rewriting an implementation, migrating a database, or proving that yesterday's output matches today's. You want tolerance per column (because floating-point drift isn't "wrong"), the differences summarized and broken down, and a structured object you can assert on in CI.

ETL teams comparing output to source. You need a recurring check that your pipeline faithfully reproduces source data, with the diagnostic depth to tell rounding drift from real corruption.

SAS analysts who miss the PROC COMPARE listing. You moved to Python (willingly or not) and want a comparison report your boss will accept — .lst format, with the same sections you've read for 20 years.

Testing frameworks for data pipelines. You're building (or using) dbt tests, custom assertion harnesses, regression suites. proccompy gives you a clean programmatic CompareResult to embed: structured fields, pytest-friendly.

What the report tells you

A real PROC COMPARE listing isn't arbitrary — it's a diagnostic protocol. Each section answers a question, in the order you'd ask it:

Dataset Summary — are we comparing what I think we're comparing?
Variables Summary — do the schemas match?
Observation Summary — do the same keys exist on both sides?
Values Comparison Summary — how broken is this, in one glance?
Variables with Unequal Values — which columns are responsible?
Statistics for Numeric Differences — bias direction, spread, magnitude
Comparison Results for Observations — show me the actual numbers

You can scan a 50-page .lst in 30 seconds and decide whether to investigate or close the ticket. That's the value.

Install

pip install proccompy

Requires Python 3.10+. Pulls in duckdb and polars. Add pandas only if you need it as input.

Usage

The basic comparison

import polars as pl
import proccompy as pc

base    = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.0, 30.0]})
compare = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.5, 30.0]})

result = pc.compare(base, compare, id_columns="id")

result.matches             # → False
result.summary()           # → per-column polars DataFrame
print(result.report())     # → terminal-style report

Inputs can be polars DataFrames, pandas DataFrames, or DuckDB relations.

Tolerances per column

from proccompy import T

result = pc.compare(
    base, compare,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # within 1 cent → equal
        "rate":          T.percent(0.001),   # within 0.1% relative → equal
        "category":      T.exact(),          # exact match required (default)
    },
)

Three tolerance types: T.exact(), T.absolute(value), T.percent(value). Per-column. Defaults to exact for columns you don't specify.

Multiple ID columns

result = pc.compare(
    base, compare,
    id_columns=["sector", "naics_code", "year"],
)

SAS-style .lst output

result.to_lst("validation.lst")

Produces a paginated, fixed-width, ASCII-only .lst file with form-feed page breaks and SAS-style headers. Looks like real PROC COMPARE output. Open it in any text editor; attach it to a ticket; share it with anyone who used to write SAS.

Assertion for CI

def test_pipeline_rewrite_matches_production():
    base = pl.read_parquet("expected.parquet")
    actual = run_pipeline()
    pc.compare(base, actual, id_columns="id").assert_matches()

Raises AssertionError with structured per-column failure detail. Drops into any pytest suite.

The structured result

CompareResult exposes everything as DataFrames you can post-process:

result.summary()             # per-column summary
result.unequal_rows()        # both-side rows with at least one diff
result.base_only_rows()      # rows only in base
result.compare_only_rows()   # rows only in compare
result.diff_dataset()        # SAS OUT= equivalent: BASE / COMPARE / DIF / PERCENT
result.joined()              # full joined frame for custom analysis

Command line

proccompy installs a proccompy console script for shell-level use — no Python required:

# Simple: compare two parquet files, report to stdout, exit 0 or 1
proccompy expected.parquet actual.parquet --id account_id

# With tolerances and multiple ID columns
proccompy old.csv new.csv --id "sector,year" \
    --tolerance "amount=abs:0.01" \
    --tolerance "rate=pct:0.001"

# Pipe everything to disk for archival / CI artifacts
proccompy a.parquet b.parquet --id id \
    --lst report.lst \
    --summary-csv summary.csv \
    --diff-parquet diffs.parquet \
    --quiet

Exit codes: 0 if the datasets match (within any specified tolerances), 1 if they differ in any compared value, row presence, or column overlap. Use as a CI gate:

proccompy expected.parquet actual.parquet --id id --quiet && deploy.sh

Supports parquet, CSV, and TSV input; format is inferred from extension or set with --format. Run proccompy --help for the full reference.

Per-column difference statistics

For numeric columns, the summary includes diagnostic statistics computed over matched non-null pairs (SAS PROC COMPARE STATS semantics):

summary = result.summary()
# Columns: column, base_dtype, compare_dtype, n_compared, n_unequal,
#          n_null_mismatch, max_abs_diff, mean_diff, std_diff, min_diff,
#          n_base_greater, n_compare_greater

These let you distinguish three failure modes that max_abs_diff alone can't:

mean_diff	std_diff	n_base_gt	n_compare_gt	Diagnosis
`+2.0`	`0.0`	`0`	`195`	Constant positive offset (unit conversion?)
`0.0`	`5.1`	`10`	`10`	No bias but high spread (real drift)
`+0.1`	`0.0`	`0`	`195`	Small uniform drift, within tolerance

Alternatives — when to use what

	Best for	Limitation
`pandas.DataFrame.compare()`	Quick cell-level diff in a notebook	No summary, no report, no tolerance
`pandas.testing.assert_frame_equal`	Strict pass/fail in tests	Stops at first diff, no diagnostic depth
`datacompy`	Flat text report for one-screen review	Memory-bound, global-only tolerance, no `.lst`, no per-column diff stats
Great Expectations, Pandera	Schema/distribution validation on one frame	Don't compare two frames
proccompy	Structured diagnostic report for pipeline validation	Out-of-core via DuckDB engine only (Spark engine not yet implemented)

If you need a one-line cell diff, use pandas.compare(). If you need a structured report you can attach to a PR, a Jira ticket, or a SAS analyst's email, use proccompy.

What's under the hood

DuckDB engine. The comparison is a FULL OUTER JOIN keyed on id_columns with per-column unequal expressions. DuckDB streams this on disk, so you're not memory-bound even for hundreds of millions of rows.
Polars I/O. Frames register zero-copy via Arrow; results come back as polars DataFrames.
Null-safe. Uses SQL IS NOT DISTINCT FROM semantics. No sentinel-string hacks.
Strict duplicate-key detection. If your ID columns aren't unique, you'll get a clear error with counts, not silent wrong answers.
Apache 2.0 licensed.

Roadmap

v0.4 — BY groups (stratified comparison summaries)
v0.5 — result.to_html() self-contained HTML report
v1.0 — pytest plugin, full documentation site, PyPI release

Stable: top-level pc.compare(), T.exact/absolute/percent, CompareResult accessor methods, .lst format, CLI flags and exit codes.

Subject to change: _joined internal structure, exact .lst column widths.

Contributing

Issues and PRs welcome. Run the test suite with pytest tests/. Currently 92 tests; please add one for any new behavior.

License

Apache 2.0. See LICENSE.

Built by Kulbhushan Sharma because the PROC COMPARE report was the SAS feature I missed most after switching to Python.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.2

May 26, 2026

0.4.1

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proccompy-0.4.2.tar.gz (42.4 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

proccompy-0.4.2-py3-none-any.whl (32.2 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file proccompy-0.4.2.tar.gz.

File metadata

Download URL: proccompy-0.4.2.tar.gz
Upload date: May 26, 2026
Size: 42.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for proccompy-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`92391e2c3ce1561a375410d44f98d5b4ed4de58a03667d8c51e21a99918b9a21`
MD5	`45f5cc06e383e6390d09a321134a8dd0`
BLAKE2b-256	`69c97b5ba2b44ca30b4adeef66788500b457687ce46ae6e798dca8d31ec0400f`

See more details on using hashes here.

File details

Details for the file proccompy-0.4.2-py3-none-any.whl.

File metadata

Download URL: proccompy-0.4.2-py3-none-any.whl
Upload date: May 26, 2026
Size: 32.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for proccompy-0.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2e63615e023ff76f0798e3bbbd9e2c4cb6638b2425ac570b833cdbfe10942133`
MD5	`adfa6862ac85018fb4023c41c80025c9`
BLAKE2b-256	`9f9be46a6e5c116dde8bb9b94550193524a5475f51ab759090ba79642bfb46b9`

See more details on using hashes here.

proccompy 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

proccompy

Why this exists

Who is this for

What the report tells you

Install

Usage

The basic comparison

Tolerances per column

Multiple ID columns

SAS-style .lst output

Assertion for CI

The structured result

Command line

Per-column difference statistics

Alternatives — when to use what

What's under the hood

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes