pl_compare

A tool to find the differences between two tables.

These details have not been verified by PyPI

Project description

pl_compare: Compare and find the differences between two Polars DataFrames.

Get statistical summaries and/or examples and/or a boolean to indicate:
- Schema differences
- Row differences
- Value differences
Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
View differences as a text report
Get differences as a Polars LazyFrame or DataFrame
Use LazyFrames for larger than memory comparisons
Specify the equality calculation that is used to dermine value differences

Installation

pip install pl_compare

Examples

Return booleans to check for schema, row and value differences

import polars as pl
from pl_compare import compare

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], base_df, compare_df)
print("is_schema_unequal:", compare_result.is_schema_unequal())
print("is_rows_unequal:", compare_result.is_rows_unequal())
print("is_values_unequal:", compare_result.is_values_unequal())

output:

is_schema_unequal: True
is_rows_unequal: True
is_values_unequal: True

Schema differences summary and details

import olars as pl.
from pl_compare import ceompar

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], base_df, compare_df)
print("schema_differences_summary()")
print(compare_result.schema_differences_summary())
print("schema_differences_sample()")
print(compare_result.schema_differences_sample())

output:

schema_differences_summary()
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 1     │
│ Columns in compare              ┆ 1     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
└─────────────────────────────────┴───────┘
schema_differences_sample()
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ Utf8        ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘

Row differences summary and details

import olars as pl.
from pl_compare import ceompar

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], base_df, compare_df)
print("row_differences_summary()")
print(compare_result.row_differences_summary())
print("row_differences_sample()")
print(compare_result.row_differences_sample())

output:

row_differences_summary()
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
row_differences_sample()
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘

Value differences summary and details

import polars as pl
from pl_compare import compare

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], base_df, compare_df)
print("value_differences_summary()")
print(compare_result.value_differences_summary())
print("value_differences_sample()")
print(compare_result.value_differences_sample())

output:

value_differences_summary()
shape: (1, 2)
┌──────────────────────────────┬───────┐
│ Value Differences for Column ┆ Count │
│ ---                          ┆ ---   │
│ str                          ┆ i64   │
╞══════════════════════════════╪═══════╡
│ Example1                     ┆ 1     │
└──────────────────────────────┴───────┘
value_differences_sample()
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘

Full report

import polars as pl
from pl_compare import compare

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], base_df, compare_df)
compare_result.report()

output:

Schema summary:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
└─────────────────────────────────┴───────┘
Schema differences: True
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ Utf8        ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘
Row summary:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
Row differences: True
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘
Value summary:
shape: (1, 2)
┌──────────────────────────────┬───────┐
│ Value Differences for Column ┆ Count │
│ ---                          ┆ ---   │
│ str                          ┆ i64   │
╞══════════════════════════════╪═══════╡
│ Example1                     ┆ 1     │
└──────────────────────────────┴───────┘
Value differences: True
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘
All differences summary:
shape: (12, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
│ Rows in base                    ┆ 3     │
│ Rows in compare                 ┆ 3     │
│ Rows only in base               ┆ 1     │
│ Rows only in compare            ┆ 1     │
│ Rows in base and compare        ┆ 2     │
│ Value diffs Col:Example1        ┆ 1     │
└─────────────────────────────────┴───────┘

Compare two pandas dataframes

import polars as pl
import pandas as pd
from pl_compare import compare

base_df = pd.DataFrame(data=
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pd.DataFrame(data=
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))
compare_result.report()

output:

Schema summary:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
└─────────────────────────────────┴───────┘
Schema differences: True
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ Utf8        ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘
Row summary:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
Row differences: True
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘
Value summary:
shape: (1, 2)
┌──────────────────────────────┬───────┐
│ Value Differences for Column ┆ Count │
│ ---                          ┆ ---   │
│ str                          ┆ i64   │
╞══════════════════════════════╪═══════╡
│ Example1                     ┆ 1     │
└──────────────────────────────┴───────┘
Value differences: True
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘
All differences summary:
shape: (12, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
│ Rows in base                    ┆ 3     │
│ Rows in compare                 ┆ 3     │
│ Rows only in base               ┆ 1     │
│ Rows only in compare            ┆ 1     │
│ Rows in base and compare        ┆ 2     │
│ Value diffs Col:Example1        ┆ 1     │
└─────────────────────────────────┴───────┘

Specify a threshold to control the frandularity of hte comparison for numeric columns.

import polars as pl
from pl_compare import compare

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1.111, 6.11, 3.11],
    }
)

compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1.114, 6.14, 3.12],
    },
)

def custom_equality_check(col: str, format: pl.DataType) -> pl.Expr:
    return (
        (pl.col(f"{col}_base") != pl.col(f"{col}_compare"))
        | (pl.col(f"{col}_base").is_null() & ~pl.col(f"{col}_compare").is_null())
        | (~pl.col(f"{col}_base").is_null() & pl.col(f"{col}_compare").is_null())
    )
print("With threshold of 0.01")
compare_result = compare(["ID"], base_df, compare_df, threshold=0.01)
print(compare_result.value_differences_sample())
print("With no threshold")
compare_result = compare(["ID"], base_df, compare_df)
print(compare_result.value_differences_sample())

output:

With threshold of 0.01
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ f64  ┆ f64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6.11 ┆ 6.14    │
└─────────┴──────────┴──────┴─────────┘
With no threshold
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
│ ID      ┆ variable ┆ base  ┆ compare │
│ ---     ┆ ---      ┆ ---   ┆ ---     │
│ str     ┆ str      ┆ f64   ┆ f64     │
╞═════════╪══════════╪═══════╪═════════╡
│ 123456  ┆ Example1 ┆ 1.111 ┆ 1.114   │
│ 1234567 ┆ Example1 ┆ 6.11  ┆ 6.14    │
└─────────┴──────────┴───────┴─────────┘

Example using alias for base and compare dataframes.

import polars as pl
from pl_compare import compare

base_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "12345678"],
        "Example1": [1, 6, 3],
        "Example2": ["1", "2", "3"],
    }
)
compare_df = pl.DataFrame(
    {
        "ID": ["123456", "1234567", "1234567810"],
        "Example1": [1, 2, 3],
        "Example2": [1, 2, 3],
        "Example3": [1, 2, 3],
    },
)

compare_result = compare(["ID"], 
                         base_df, 
                         compare_df, 
                         base_alias="before_change", 
                         compare_alias="after_change")

print("value_differences_summary()")
print(compare_result.schema_differences_sample())
print("value_differences_sample()")
print(compare_result.value_differences_sample())

output:

value_differences_summary()
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
│ column   ┆ before_change_format ┆ after_change_format │
│ ---      ┆ ---                  ┆ ---                 │
│ str      ┆ str                  ┆ str                 │
╞══════════╪══════════════════════╪═════════════════════╡
│ Example2 ┆ Utf8                 ┆ Int64               │
│ Example3 ┆ null                 ┆ Int64               │
└──────────┴──────────────────────┴─────────────────────┘
value_differences_sample()
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
│ ID      ┆ variable ┆ before_change ┆ after_change │
│ ---     ┆ ---      ┆ ---           ┆ ---          │
│ str     ┆ str      ┆ i64           ┆ i64          │
╞═════════╪══════════╪═══════════════╪══════════════╡
│ 1234567 ┆ Example1 ┆ 6             ┆ 2            │
└─────────┴──────────┴───────────────┴──────────────┘

To DO:

Linting (Ruff)
Make into python package
Add makefile for easy linting and tests
Statistics should indicate which statistics are referencing columns
Add all statistics frame to tests
Add schema differences to schema summary
Make row examples alternate between base only and compare only so that it is more readable.
Add limit value to the examples.
Updated value differences summary so that Statistic is something that makes sense.
Publish package to pypi
Add difference criterion.
Add license
Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
Add table name labels that can replace 'base' and 'compare'.
Update code to use a config dataclass that can be passed between the class and functions.
[] Write up docstrings
Write up readme (with code examples)
Add parameter to hide column differences with 0 differences.
Add flag to indicate if there are differences between the tables.
Update report so that non differences are not displayed.
[] Raise error and print examples if duplicates are present.
[] Add a count of the number of rows that have any differences to the value differences summary.
[] Add total number of value differences to the value differences summary.
[] Change id_columns to be named 'join_on' and add a test that checks that abritrary join conditions work.
[] Simplify custom equality checks and add example.
[] Test for large amounts of data
[] Benchmark for different sizes of data.
[] strict MyPy type checking
[] Github actions for testing
[] Github actions for linting
Github actions for publishing

not sure of:

[] Seperate out dev dependencies from library dependencies?
[] Change 'threshold' to be 'granularity'/'numeric granularity?'

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.10.1

Feb 28, 2026

0.10.0

Feb 21, 2026

0.9.3

Dec 12, 2025

0.9.2

Apr 17, 2025

0.9.1

Apr 3, 2025

0.9.0

Apr 3, 2025

0.8.0

Mar 10, 2025

0.7.1

Feb 17, 2025

0.7.0

Feb 16, 2025

0.6.0

Oct 27, 2024

0.5.2

Aug 5, 2024

0.5.1

Aug 2, 2024

0.5.0

Jul 19, 2024

0.4.2

Mar 15, 2024

0.4.1

Mar 15, 2024

0.3.6

Mar 2, 2024

0.3.5

Dec 20, 2023

0.3.4

Dec 20, 2023

0.3.3

Dec 18, 2023

0.3.1

Dec 3, 2023

0.2.4

Dec 3, 2023

0.2.3

Dec 3, 2023

0.2.1

Nov 24, 2023

0.1.24

Oct 27, 2023

0.1.23

Oct 27, 2023

0.1.22

Oct 13, 2023

0.1.21

Sep 29, 2023

0.1.20

Sep 29, 2023

0.1.19

Sep 29, 2023

This version

0.1.18

Sep 29, 2023

0.1.17

Sep 29, 2023

0.1.16

Sep 29, 2023

0.1.15

Sep 29, 2023

0.1.14

Sep 29, 2023

0.1.12

Sep 15, 2023

0.1.11

Sep 15, 2023

0.1.9

Sep 12, 2023

0.1.8

Sep 12, 2023

0.1.7

Sep 8, 2023

0.1.6

Sep 8, 2023

0.1.5

Sep 8, 2023

0.1.4

Sep 8, 2023

0.1.3

Sep 8, 2023

0.1.2

Sep 8, 2023

0.1.1

Sep 8, 2023

0.1.0

Sep 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pl_compare-0.1.18.tar.gz (11.9 kB view details)

Uploaded Sep 29, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pl_compare-0.1.18-py3-none-any.whl (11.7 kB view details)

Uploaded Sep 29, 2023 Python 3

File details

Details for the file pl_compare-0.1.18.tar.gz.

File metadata

Download URL: pl_compare-0.1.18.tar.gz
Upload date: Sep 29, 2023
Size: 11.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.12 Linux/6.1.47

File hashes

Hashes for pl_compare-0.1.18.tar.gz
Algorithm	Hash digest
SHA256	`6d71c354da5f6f6bc127168555a67521be24d8daa33c2c6c777944a06365386d`
MD5	`3e767767eccbc8308671a3c0f61495cd`
BLAKE2b-256	`461e5a509b81002ba05c98a2a7dbfb1f9b73166873e14c2eb37e45c7640d3f3a`

See more details on using hashes here.

File details

Details for the file pl_compare-0.1.18-py3-none-any.whl.

File metadata

Download URL: pl_compare-0.1.18-py3-none-any.whl
Upload date: Sep 29, 2023
Size: 11.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.4.2 CPython/3.10.12 Linux/6.1.47

File hashes

Hashes for pl_compare-0.1.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6991e39ddbd17addc262469f3c0aeb3ac2c2eb8f87fe96c64b74383659faa9a6`
MD5	`15e230edf8b71bf3a7832bb09e34008a`
BLAKE2b-256	`d729c6f50b77b720de19e8532ca3eb39addb3cf9a946e63929381c1f5abf651b`

See more details on using hashes here.

pl_compare 0.1.18

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

pl_compare: Compare and find the differences between two Polars DataFrames.

Installation

Examples

To DO:

not sure of:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes