A tool to find the differences between two tables.
Project description
pl_compare: Compare and find the differences between two Polars DataFrames.
- Get statistical summaries and/or examples and/or a boolean to indicate:
- Schema differences
- Row differences
- Value differences
- Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
- View differences as a text report
- Get differences as a Polars LazyFram or DataFrame.
- Use LazyFrames for larger than memory comparisons
- Specify the equality calculation that is used to dermine value differences
Installation
pip install pl_compare
Examples
Return booleans to check for schema, row and value differences
import polars as pl
from pl_compare import compare
base_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "12345678"],
"Example1": [1, 6, 3],
"Example2": ["1", "2", "3"],
}
)
compare_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "1234567810"],
"Example1": [1, 2, 3],
"Example2": [1, 2, 3],
"Example3": [1, 2, 3],
},
)
compare_result = compare(["ID"], base_df, compare_df)
print("is_schema_unequal:", compare_result.is_schema_unequal())
print("is_rows_unequal:", compare_result.is_rows_unequal())
print("is_values_unequal:", compare_result.is_values_unequal())
output:
is_schema_unequal: True
is_rows_unequal: True
is_values_unequal: True
Schema differences summary
import polars as pl
from pl_compare import compare
base_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "12345678"],
"Example1": [1, 6, 3],
"Example2": ["1", "2", "3"],
}
)
compare_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "1234567810"],
"Example1": [1, 2, 3],
"Example2": [1, 2, 3],
"Example3": [1, 2, 3],
},
)
compare_result = compare(["ID"], base_df, compare_df)
print(compare_result.schema_differences_summary())
output:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═══════╡
│ Columns in base ┆ 1 │
│ Columns in compare ┆ 1 │
│ Columns in base and compare ┆ 3 │
│ Columns only in base ┆ 0 │
│ Columns only in compare ┆ 1 │
│ Columns with schema differences ┆ 1 │
└─────────────────────────────────┴───────┘
Schema differences details
import polars as pl
from pl_compare import compare
base_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "12345678"],
"Example1": [1, 6, 3],
"Example2": ["1", "2", "3"],
}
)
compare_df = pl.DataFrame(
{
"ID": ["123456", "1234567", "1234567810"],
"Example1": [1, 2, 3],
"Example2": [1, 2, 3],
"Example3": [1, 2, 3],
},
)
compare_result = compare(["ID"], base_df, compare_df)
print(compare_result.schema_differences_summary())
output:
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column ┆ base_format ┆ compare_format │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ Utf8 ┆ Int64 │
│ Example3 ┆ null ┆ Int64 │
└──────────┴─────────────┴────────────────┘
- pandas comparison example
- custom equality function
- use of column aliases
To DO:
- Linting (Ruff)
- Make into python package
- Add makefile for easy linting and tests
- Statistics should indicate which statistics are referencing columns
- Add all statistics frame to tests
- Add schema differences to schema summary
- Make row examples alternate between base only and compare only so that it is more readable.
- Add limit value to the examples.
- Updated value differences summary so that Statistic is something that makes sense.
- Publish package to pypi
- Add difference criterion.
- Add license
- Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
- Add table name labels that can replace 'base' and 'compare'.
- [] Write up docstrings
- [] Write up readme (with code examples)
- [] Raise error and print examples if duplicates are present.
- [] Add a count of the number of rows that have any differences to the value differences summary.
- [] Add total number of value differences to the value differences summary.
- [] Add parameter to hide column differences with 0 differences.
- [] Update report so that non differences are (optionally) not displayed.
- [] Change id_columns to be named 'join_on' and add a test that checks that abritrary join conditions work.
- [] Update code to use a config dataclass that can be passed between the class and functions.
- [] Test for large amounts of data
- [] Benchmark for different sizes of data.
- [] strict MyPy type checking
- [] Github actions for testing
- [] Github actions for linting
- [] Github actions for publishing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pl_compare-0.1.9.tar.gz
(9.2 kB
view hashes)
Built Distribution
Close
Hashes for pl_compare-0.1.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f072d10ddb0d979d65a80944cc50ad9542f3b07bdf97f1f7d416b72df0dd474 |
|
MD5 | 569b957c32dd3c03dda0dbc868b1532d |
|
BLAKE2b-256 | 4f0bcab9dfeba545fee7bc3363791587f055f1b34e2c4df1f8b51f542c5d98c1 |