Skip to main content

Structured comparison reports for tabular data in Python — modeled on SAS PROC COMPARE.

Project description

proccompy

Structured comparison reports for tabular data in Python.

If you've ever needed to prove that two DataFrames hold the same data — after a pipeline rewrite, between source and ETL output, across a database migration, or in a daily regression check — you've probably found that pandas.DataFrame.compare() shows you cell-level diffs, assert_frame_equal gives you pass/fail, and neither tells you the thing you actually want to know: where do they differ, by how much, in which direction, and does it matter?

proccompy produces a structured diagnostic report that answers those questions in the order you'd ask them. It's modeled on SAS PROC COMPARE — a 40-year-old procedure that data analysts trust because it makes "what changed?" a 30-second scan instead of an afternoon of investigation.

import polars as pl
import proccompy as pc
from proccompy import T

base    = pl.read_parquet("production.parquet")
rewrite = pl.read_parquet("new_implementation.parquet")

result = pc.compare(
    base, rewrite,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # rounding-level diffs are OK
        "rate":          T.percent(0.001),   # but >0.1% relative is not
    },
)

result.to_lst("validation.lst")     # SAS-style listing for archival / review
result.assert_matches()              # raises AssertionError if they don't — for CI

Why this exists

The Python ecosystem has many ways to find differences between two DataFrames. What it doesn't have is a structured report that helps you interpret them.

  • pandas.DataFrame.compare() returns a multi-index DataFrame of cell-level diffs. No summary, no row-presence breakdown, no statistics.
  • pandas.testing.assert_frame_equal raises on first difference. Tells you something is wrong, not what.
  • datacompy has a report, but it's a flat text dump optimized for a single screen. No per-column difference statistics, no SAS conventions, no .lst output, no streaming/out-of-core support.
  • Schema-validation tools (Great Expectations, Pandera) check one frame against rules. They don't compare two frames.

The result: every team building a data pipeline ends up writing their own ad-hoc comparison script. That script's report is always a little worse than PROC COMPARE's because PROC COMPARE has had four decades of refinement.

proccompy brings that report to Python natively — running on polars, pandas, or DuckDB; pluggable into pytest; producing .lst files SAS analysts already know how to read.

Who is this for

Data engineers doing pipeline validation. You're rewriting an implementation, migrating a database, or proving that yesterday's output matches today's. You want tolerance per column (because floating-point drift isn't "wrong"), the differences summarized and broken down, and a structured object you can assert on in CI.

ETL teams comparing output to source. You need a recurring check that your pipeline faithfully reproduces source data, with the diagnostic depth to tell rounding drift from real corruption.

SAS analysts who miss the PROC COMPARE listing. You moved to Python (willingly or not) and want a comparison report your boss will accept — .lst format, with the same sections you've read for 20 years.

Testing frameworks for data pipelines. You're building (or using) dbt tests, custom assertion harnesses, regression suites. proccompy gives you a clean programmatic CompareResult to embed: structured fields, pytest-friendly.

What the report tells you

A real PROC COMPARE listing isn't arbitrary — it's a diagnostic protocol. Each section answers a question, in the order you'd ask it:

  1. Dataset Summary — are we comparing what I think we're comparing?
  2. Variables Summary — do the schemas match?
  3. Observation Summary — do the same keys exist on both sides?
  4. Values Comparison Summary — how broken is this, in one glance?
  5. Variables with Unequal Values — which columns are responsible?
  6. Statistics for Numeric Differences — bias direction, spread, magnitude
  7. Comparison Results for Observations — show me the actual numbers

You can scan a 50-page .lst in 30 seconds and decide whether to investigate or close the ticket. That's the value.

Install

pip install proccompy

Requires Python 3.10+. Pulls in duckdb and polars. Add pandas only if you need it as input.

Usage

The basic comparison

import polars as pl
import proccompy as pc

base    = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.0, 30.0]})
compare = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.5, 30.0]})

result = pc.compare(base, compare, id_columns="id")

result.matches             # → False
result.summary()           # → per-column polars DataFrame
print(result.report())     # → terminal-style report

Inputs can be polars DataFrames, pandas DataFrames, or DuckDB relations.

Tolerances per column

from proccompy import T

result = pc.compare(
    base, compare,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # within 1 cent → equal
        "rate":          T.percent(0.001),   # within 0.1% relative → equal
        "category":      T.exact(),          # exact match required (default)
    },
)

Three tolerance types: T.exact(), T.absolute(value), T.percent(value). Per-column. Defaults to exact for columns you don't specify.

Multiple ID columns

result = pc.compare(
    base, compare,
    id_columns=["sector", "naics_code", "year"],
)

SAS-style .lst output

result.to_lst("validation.lst")

Produces a paginated, fixed-width, ASCII-only .lst file with form-feed page breaks and SAS-style headers. Looks like real PROC COMPARE output. Open it in any text editor; attach it to a ticket; share it with anyone who used to write SAS.

Assertion for CI

def test_pipeline_rewrite_matches_production():
    base = pl.read_parquet("expected.parquet")
    actual = run_pipeline()
    pc.compare(base, actual, id_columns="id").assert_matches()

Raises AssertionError with structured per-column failure detail. Drops into any pytest suite.

The structured result

CompareResult exposes everything as DataFrames you can post-process:

result.summary()             # per-column summary
result.unequal_rows()        # both-side rows with at least one diff
result.base_only_rows()      # rows only in base
result.compare_only_rows()   # rows only in compare
result.diff_dataset()        # SAS OUT= equivalent: BASE / COMPARE / DIF / PERCENT
result.joined()              # full joined frame for custom analysis

Command line

proccompy installs a proccompy console script for shell-level use — no Python required:

# Simple: compare two parquet files, report to stdout, exit 0 or 1
proccompy expected.parquet actual.parquet --id account_id

# With tolerances and multiple ID columns
proccompy old.csv new.csv --id "sector,year" \
    --tolerance "amount=abs:0.01" \
    --tolerance "rate=pct:0.001"

# Pipe everything to disk for archival / CI artifacts
proccompy a.parquet b.parquet --id id \
    --lst report.lst \
    --summary-csv summary.csv \
    --diff-parquet diffs.parquet \
    --quiet

Exit codes: 0 if the datasets match (within any specified tolerances), 1 if they differ in any compared value, row presence, or column overlap. Use as a CI gate:

proccompy expected.parquet actual.parquet --id id --quiet && deploy.sh

Supports parquet, CSV, and TSV input; format is inferred from extension or set with --format. Run proccompy --help for the full reference.

Per-column difference statistics

For numeric columns, the summary includes diagnostic statistics computed over matched non-null pairs (SAS PROC COMPARE STATS semantics):

summary = result.summary()
# Columns: column, base_dtype, compare_dtype, n_compared, n_unequal,
#          n_null_mismatch, max_abs_diff, mean_diff, std_diff, min_diff,
#          n_base_greater, n_compare_greater

These let you distinguish three failure modes that max_abs_diff alone can't:

mean_diff std_diff n_base_gt n_compare_gt Diagnosis
+2.0 0.0 0 195 Constant positive offset (unit conversion?)
0.0 5.1 10 10 No bias but high spread (real drift)
+0.1 0.0 0 195 Small uniform drift, within tolerance

Alternatives — when to use what

Best for Limitation
pandas.DataFrame.compare() Quick cell-level diff in a notebook No summary, no report, no tolerance
pandas.testing.assert_frame_equal Strict pass/fail in tests Stops at first diff, no diagnostic depth
datacompy Flat text report for one-screen review Memory-bound, global-only tolerance, no .lst, no per-column diff stats
Great Expectations, Pandera Schema/distribution validation on one frame Don't compare two frames
proccompy Structured diagnostic report for pipeline validation Out-of-core via DuckDB engine only (Spark engine not yet implemented)

If you need a one-line cell diff, use pandas.compare(). If you need a structured report you can attach to a PR, a Jira ticket, or a SAS analyst's email, use proccompy.

What's under the hood

  • DuckDB engine. The comparison is a FULL OUTER JOIN keyed on id_columns with per-column unequal expressions. DuckDB streams this on disk, so you're not memory-bound even for hundreds of millions of rows.
  • Polars I/O. Frames register zero-copy via Arrow; results come back as polars DataFrames.
  • Null-safe. Uses SQL IS NOT DISTINCT FROM semantics. No sentinel-string hacks.
  • Strict duplicate-key detection. If your ID columns aren't unique, you'll get a clear error with counts, not silent wrong answers.
  • Apache 2.0 licensed.

Roadmap

  • v0.4 — BY groups (stratified comparison summaries)
  • v0.5result.to_html() self-contained HTML report
  • v1.0 — pytest plugin, full documentation site, PyPI release

Stable: top-level pc.compare(), T.exact/absolute/percent, CompareResult accessor methods, .lst format, CLI flags and exit codes.

Subject to change: _joined internal structure, exact .lst column widths.

Contributing

Issues and PRs welcome. Run the test suite with pytest tests/. Currently 92 tests; please add one for any new behavior.

License

Apache 2.0. See LICENSE.


Built by Kulbhushan Sharma because the PROC COMPARE report was the SAS feature I missed most after switching to Python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proccompy-0.4.1.tar.gz (37.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

proccompy-0.4.1-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file proccompy-0.4.1.tar.gz.

File metadata

  • Download URL: proccompy-0.4.1.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for proccompy-0.4.1.tar.gz
Algorithm Hash digest
SHA256 ecb6c1b5eebb9f31fcca07844c536d41003798c652ac48ab7b8e7ec98f044286
MD5 ddecabfb67736cd5f4dc90b18009b67f
BLAKE2b-256 5143d7ed7ad0332834df030e2604e82e084a0708250fcec00d2f44faf00b3398

See more details on using hashes here.

File details

Details for the file proccompy-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: proccompy-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for proccompy-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 612e6e71d8c0824f0c2f7bbbe4709857b74a7a1a62af46f1b2d2bc0c3748f5f5
MD5 cf9795c76314caaa87475a043e49588f
BLAKE2b-256 6b5b29cc0bac012728e3304f1de45a82550e418a70e0dc8b185791d67955efeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page