Skip to main content

A tool to find the differences between two tables.

Project description

pl_compare: Compare and find the differences between two Polars DataFrames.

Github - PyPi Page

You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to:

  • Understand how well two tables Reconcile example
  • Find the schema differences between two tables example
  • Find counts or examples of rows that exist in one table but not another example
  • Find counts or examples of value differences between two tables example
  • Assert that two tables are exactly equal (such as for an automated test) example
  • Assert that two tables have matching schemas, rows or column values example

Click for a jupyter notebook with example usage

With pl-compare you can:

  • Get statistical summaries and/or examples and/or a boolean to indicate:
    • Schema differences
    • Row differences
    • Value differences
  • Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
  • View differences as a text report
  • Get differences as a Polars LazyFrame or DataFrame
  • Use LazyFrames for larger than memory comparisons
  • Specify the equality calculation that is used to dermine value differences

Installation

pip install pl_compare

Examples (click to expand)

Return booleans to check for schema, row and value differences

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("is_schemas_equal:", compare_result.is_schemas_equal())
is_schemas_equal: False
>>> print("is_rows_equal:", compare_result.is_rows_equal())
is_rows_equal: False
>>> print("is_values_equal:", compare_result.is_values_equal())
is_values_equal: False
>>>

Schema differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("schemas_summary()")
schemas_summary()
>>> print(compare_result.schemas_summary())
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema difference...  1     
└─────────────────────────────────┴───────┘
>>> print("schemas_sample()")
schemas_sample()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
>>>

Row differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("rows_summary()")
rows_summary()
>>> print(compare_result.rows_summary())
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
>>> print("rows_sample()")
rows_sample()
>>> print(compare_result.rows_sample())
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
>>>

Value differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("values_summary()")
values_summary()
>>> print(compare_result.values_summary())
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
>>>

Full report

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> compare_result.report()
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
<BLANKLINE>
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema difference...  1     
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Compare two pandas dataframes

>>> import polars as pl
>>> import pandas as pd # doctest: +SKIP
>>> from pl_compare import compare
>>>
>>> base_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )# doctest: +SKIP
>>> compare_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )# doctest: +SKIP
>>>
>>> compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP
>>> compare_result.report()# doctest: +SKIP
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------

SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema differences  1     
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------

ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Specify a threshold to control the granularity of the comparison for numeric columns.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1.111, 6.11, 3.11],
...     }
... )
>>>
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1.114, 6.14, 3.12],
...     },
... )
>>>
>>> print("With equality_resolution of 0.01")
With equality_resolution of 0.01
>>> compare_result = compare(["ID"], base_df, compare_df, resolution=0.01)
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       f64   f64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6.11  6.14    
└─────────┴──────────┴──────┴─────────┘
>>> print("With no equality_resolution")
With no equality_resolution
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print(compare_result.values_sample())
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
 ID       variable  base   compare 
 ---      ---       ---    ---     
 str      str       f64    f64     
╞═════════╪══════════╪═══════╪═════════╡
 123456   Example1  1.111  1.114   
 1234567  Example1  6.11   6.14    
└─────────┴──────────┴───────┴─────────┘
>>>

Example using alias for base and compare dataframes.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"],
...                          base_df,
...                          compare_df,
...                          base_alias="before_change",
...                          compare_alias="after_change")
>>>
>>> print("values_summary()")
values_summary()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
 column    before_change_format  after_change_format 
 ---       ---                   ---                 
 str       str                   str                 
╞══════════╪══════════════════════╪═════════════════════╡
 Example2  String                Int64               
 Example3  null                  Int64               
└──────────┴──────────────────────┴─────────────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
 ID       variable  before_change  after_change 
 ---      ---       ---            ---          
 str      str       i64            i64          
╞═════════╪══════════╪═══════════════╪══════════════╡
 1234567  Example1  6              2            
└─────────┴──────────┴───────────────┴──────────────┘
>>>

Assert two frames are equal for a test

>>> import polars as pl
>>> import pytest
>>> from pl_compare.compare import compare
>>>
>>> def test_example():
...     base_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 3],
...             "Example2": [1, 2, 3],
...         }
...     )
...     compare_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 9],
...             "Example2": [1, 2, 3],
...         }
...     )
...     comparison = compare(["ID"], base_df, compare_df)
...     if not comparison.is_equal():
...         raise Exception(comparison.report())
...
>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in test_example
Exception: --------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
No Schema differences found.
--------------------------------------------------------------------------------
No Row differences found (when joining by the supplied id_columns).
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (3, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      16.666667  
 Example1                 1      33.333333  
 Example2                 0      0.0        
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌──────────┬──────────┬──────┬─────────┐
 ID        variable  base  compare 
 ---       ---       ---   ---     
 str       str       i64   i64     
╞══════════╪══════════╪══════╪═════════╡
 12345678  Example1  3     9       
└──────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

To DO:

  • Linting (Ruff)
  • Make into python package
  • Add makefile for easy linting and tests
  • Statistics should indicate which statistics are referencing columns
  • Add all statistics frame to tests
  • Add schema differences to schema summary
  • Make row examples alternate between base only and compare only so that it is more readable.
  • Add limit value to the examples.
  • Updated value differences summary so that Statistic is something that makes sense.
  • Publish package to pypi
  • Add difference criterion.
  • Add license
  • Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
  • Add table name labels that can replace 'base' and 'compare'.
  • Update code to use a config dataclass that can be passed between the class and functions.
  • Write up docstrings
  • Write up readme (with code examples)
  • Add parameter to hide column differences with 0 differences.
  • Add flag to indicate if there are differences between the tables.
  • Update report so that non differences are not displayed.
  • Seperate out dev dependencies from library dependencies?
  • Change 'threshold' to be equality resolution.
  • strict MyPy type checking
  • Raise error and print examples if duplicates are present.
  • Add total number of value differences to the value differences summary.
  • Add percentage column so the value differences summary.
  • Change id_columns to be named 'join_columns'
  • Github actions for publishing
  • Update the duplication validation.
  • Fix report output when tables are exactly equal.
  • Github actions for testing
  • Github actions for linting
  • Add message when there are no columns left to be compared.
  • Add message when df's are exactly equal.
  • Add test case with exactly equal dfs.
  • Add test case with no columns being compared. Make sure an error is raised when the value_summary and value_sample methods are called.
  • [] Test for large amounts of data
  • [] Benchmark for different sizes of data.
  • [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?
  • [] There still seems to be a bug when converting from lazy to data frame using streaming (i.e. in the convert_to_dataframe function)

Ideas:

  • [] Simplify custom equality checks and add example.
  • [] Add a count of the number of rows that have any differences to the value differences summary.
  • [] add a test that checks that abritrary join conditions work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pl_compare-0.5.1.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

pl_compare-0.5.1-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file pl_compare-0.5.1.tar.gz.

File metadata

  • Download URL: pl_compare-0.5.1.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for pl_compare-0.5.1.tar.gz
Algorithm Hash digest
SHA256 a064e7d049e9b50210171a38dbb3796251a7eca9a451113dfb94a271f755c526
MD5 658096713c36616af594d544a8eef9b0
BLAKE2b-256 5612f819e49ad6b11d60070358fb6385bd72f5e409fad5f16a1812e9ffb11ffd

See more details on using hashes here.

File details

Details for the file pl_compare-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: pl_compare-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 15.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for pl_compare-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 262ce810ec87e70974a38988ddcd536abc7563fb2ac60e5bb110b7386862e1b7
MD5 774b4d87f7fd3e32a064b0f7e2e90cf0
BLAKE2b-256 ce1a1a58267345bb72c7b68d79f66ade1797f307dcfff5b05b34616d2dbcde42

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page