Skip to main content

A tool to find the differences between two tables.

Project description

pl_compare: Compare and find the differences between two Polars DataFrames.

Github - PyPi Page

You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to:

  • Understand how well two tables Reconcile example
  • Find the schema differences between two tables example
  • Find counts or examples of rows that exist in one table but not another example
  • Find counts or examples of value differences between two tables example
  • Assert that two tables are exactly equal (such as for an automated test) example
  • Assert that two tables have matching schemas, rows or column values example

Click for a jupyter notebook with example usage

With pl-compare you can:

  • Get statistical summaries and/or examples and/or a boolean to indicate:
    • Schema differences
    • Row differences
    • Value differences
  • Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
  • View differences as a text report
  • Get differences as a Polars LazyFrame or DataFrame
  • Use LazyFrames for larger than memory comparisons
  • Specify the equality calculation that is used to dermine value differences

Installation

pip install pl_compare

Examples (click to expand)

Return booleans to check for schema, row and value differences

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("is_schemas_equal:", compare_result.is_schemas_equal())
is_schemas_equal: False
>>> print("is_rows_equal:", compare_result.is_rows_equal())
is_rows_equal: False
>>> print("is_values_equal:", compare_result.is_values_equal())
is_values_equal: False
>>>

Schema differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("schemas_summary()")
schemas_summary()
>>> print(compare_result.schemas_summary())
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema differences  1     
└─────────────────────────────────┴───────┘
>>> print("schemas_sample()")
schemas_sample()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
>>>

Row differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("rows_summary()")
rows_summary()
>>> print(compare_result.rows_summary())
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
>>> print("rows_sample()")
rows_sample()
>>> print(compare_result.rows_sample())
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
>>>

Value differences summary and details

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("values_summary()")
values_summary()
>>> print(compare_result.values_summary())
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
>>>

Full report

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> compare_result.report()
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
<BLANKLINE>
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema differences  1     
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Compare two pandas dataframes

>>> import polars as pl
>>> import pandas as pd # doctest: +SKIP
>>> from pl_compare import compare
>>>
>>> base_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )# doctest: +SKIP
>>> compare_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )# doctest: +SKIP
>>>
>>> compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP
>>> compare_result.report()# doctest: +SKIP
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------

SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
 Statistic                        Count 
 ---                              ---   
 str                              i64   
╞═════════════════════════════════╪═══════╡
 Columns in base                  3     
 Columns in compare               4     
 Columns in base and compare      3     
 Columns only in base             0     
 Columns only in compare          1     
 Columns with schema differences  1     
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
 column    base_format  compare_format 
 ---       ---          ---            
 str       str          str            
╞══════════╪═════════════╪════════════════╡
 Example2  String       Int64          
 Example3  null         Int64          
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------

ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
 Statistic                 Count 
 ---                       ---   
 str                       i64   
╞══════════════════════════╪═══════╡
 Rows in base              3     
 Rows in compare           3     
 Rows only in base         1     
 Rows only in compare      1     
 Rows in base and compare  2     
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
 ID          variable  value           
 ---         ---       ---             
 str         str       str             
╞════════════╪══════════╪═════════════════╡
 12345678    status    in base only    
 1234567810  status    in compare only 
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      50.0       
 Example1                 1      50.0       
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       i64   i64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6     2       
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

Specify a threshold to control the granularity of the comparison for numeric columns.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1.111, 6.11, 3.11],
...     }
... )
>>>
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1.114, 6.14, 3.12],
...     },
... )
>>>
>>> print("With equality_resolution of 0.01")
With equality_resolution of 0.01
>>> compare_result = compare(["ID"], base_df, compare_df, resolution=0.01)
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
 ID       variable  base  compare 
 ---      ---       ---   ---     
 str      str       f64   f64     
╞═════════╪══════════╪══════╪═════════╡
 1234567  Example1  6.11  6.14    
└─────────┴──────────┴──────┴─────────┘
>>> print("With no equality_resolution")
With no equality_resolution
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print(compare_result.values_sample())
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
 ID       variable  base   compare 
 ---      ---       ---    ---     
 str      str       f64    f64     
╞═════════╪══════════╪═══════╪═════════╡
 123456   Example1  1.111  1.114   
 1234567  Example1  6.11   6.14    
└─────────┴──────────┴───────┴─────────┘
>>>

Example using alias for base and compare dataframes.

>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"],
...                          base_df,
...                          compare_df,
...                          base_alias="before_change",
...                          compare_alias="after_change")
>>>
>>> print("values_summary()")
values_summary()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
 column    before_change_format  after_change_format 
 ---       ---                   ---                 
 str       str                   str                 
╞══════════╪══════════════════════╪═════════════════════╡
 Example2  String                Int64               
 Example3  null                  Int64               
└──────────┴──────────────────────┴─────────────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
 ID       variable  before_change  after_change 
 ---      ---       ---            ---          
 str      str       i64            i64          
╞═════════╪══════════╪═══════════════╪══════════════╡
 1234567  Example1  6              2            
└─────────┴──────────┴───────────────┴──────────────┘
>>>

Assert two frames are equal for a test

>>> import polars as pl
>>> import pytest
>>> from pl_compare.compare import compare
>>>
>>> def test_example():
...     base_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 3],
...             "Example2": [1, 2, 3],
...         }
...     )
...     compare_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 9],
...             "Example2": [1, 2, 3],
...         }
...     )
...     comparison = compare(["ID"], base_df, compare_df)
...     if not comparison.is_equal():
...         raise Exception(comparison.report())
...
>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in test_example
Exception: --------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
No Schema differences found.
--------------------------------------------------------------------------------
No Row differences found (when joining by the supplied id_columns).
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (3, 3)
┌─────────────────────────┬───────┬────────────┐
 Value Differences        Count  Percentage 
 ---                      ---    ---        
 str                      i64    f64        
╞═════════════════════════╪═══════╪════════════╡
 Total Value Differences  1      16.666667  
 Example1                 1      33.333333  
 Example2                 0      0.0        
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌──────────┬──────────┬──────┬─────────┐
 ID        variable  base  compare 
 ---       ---       ---   ---     
 str       str       i64   i64     
╞══════════╪══════════╪══════╪═════════╡
 12345678  Example1  3     9       
└──────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>

To DO:

  • Linting (Ruff)
  • Make into python package
  • Add makefile for easy linting and tests
  • Statistics should indicate which statistics are referencing columns
  • Add all statistics frame to tests
  • Add schema differences to schema summary
  • Make row examples alternate between base only and compare only so that it is more readable.
  • Add limit value to the examples.
  • Updated value differences summary so that Statistic is something that makes sense.
  • Publish package to pypi
  • Add difference criterion.
  • Add license
  • Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
  • Add table name labels that can replace 'base' and 'compare'.
  • Update code to use a config dataclass that can be passed between the class and functions.
  • Write up docstrings
  • Write up readme (with code examples)
  • Add parameter to hide column differences with 0 differences.
  • Add flag to indicate if there are differences between the tables.
  • Update report so that non differences are not displayed.
  • Seperate out dev dependencies from library dependencies?
  • Change 'threshold' to be equality resolution.
  • strict MyPy type checking
  • Raise error and print examples if duplicates are present.
  • Add total number of value differences to the value differences summary.
  • Add percentage column so the value differences summary.
  • Change id_columns to be named 'join_columns'
  • Github actions for publishing
  • Update the duplication validation.
  • Fix report output when tables are exactly equal.
  • Github actions for testing
  • Github actions for linting
  • [] Test for large amounts of data
  • [] Benchmark for different sizes of data.
  • [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?

Ideas:

  • [] Simplify custom equality checks and add example.
  • [] Add a count of the number of rows that have any differences to the value differences summary.
  • [] add a test that checks that abritrary join conditions work.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pl_compare-0.4.2.tar.gz (15.5 kB view hashes)

Uploaded Source

Built Distribution

pl_compare-0.4.2-py3-none-any.whl (14.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page