DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.
Project description
DataFingerprint
DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.
Features
- Column Name Differences: Identify columns that are present in one dataset but missing in the other.
- Column Data Type Differences: Detect discrepancies in data types between corresponding columns in the two datasets.
- Row Differences: Find rows that are present in one dataset but missing in the other, or rows that have different values in corresponding columns.
- Paired Row Differences: Compare rows that have the same primary key or unique identifier in both datasets and identify differences in their values.
- Data Report: Generate a comprehensive report summarizing all the differences found between the two datasets.
| function | purpose | result |
|---|---|---|
data_fingerprint.src.comparator.get_data_report |
Get data report object that has all the information about the differences | data_fingerprint.src.models.DataReport |
data_fingerprint.src.utils.get_dataframe |
Get polars.Dataframe of rows that are different (added source column) | polars.DataFrame |
data_fingerprint.src.utils.get_number_of_row_differences |
Get the number of different rows | int |
data_fingerprint.src.utils.get_number_of_differences_per_source |
Get the number of row differences per source | dict[str, int] |
data_fingerprint.src.utils.get_ratio_of_differences_per_source |
Get the ratio of row differences per source | dict[str, float] |
data_fingerprint.src.utils.get_column_difference_ratio |
[When grouping is used] Get the distribution of differences per column | dict[str, float] |
Installation
To install DataFingerprint, you can use pip:
pip install data-fingerprint
Examples
Here's a basic example of how to use DataFingerprint to compare two datasets:
import polars as pl
from data_fingerprint.src.utils import get_dataframe
from data_fingerprint.src.comparator import get_data_report
from data_fingerprint.src.models import DataReport
# Create two sample datasets
df1 = pl.DataFrame(
{"id": [1, 2, 3, 4], "name": ["Alice", "Bob", "Charlie", "George"], "age": [25, 30, 35, 26], "height": [170, 180, 175, 160], "weight": [60, 70, 75, 65]}
)
df2 = pl.DataFrame(
{"id": [1, 2, 3], "name": ["Alice", "Bob", "David"], "age": [25, 30, 35], "weight": ["60", "70", "75"], "married": [True, False, True]}
)
# Generate a data report comparing the two datasets
report: DataReport = get_data_report(df1, df2, "df_0", "df_1", grouping_columns=["id"])
print(report.model_dump_json(indent=4))
Output:
{
"df0_length": 4,
"df1_length": 3,
"df0_name": "df_0",
"df1_name": "df_1",
"comparable_columns": [
"name",
"id",
"age"
],
"column_differences": [
{
"source": "df_0",
"column_name": "married",
"difference_type": "MISSING",
"more_information": null
},
{
"source": "df_0",
"column_name": "height",
"difference_type": "EXTRA",
"more_information": null
},
{
"source": "df_0",
"column_name": "weight",
"difference_type": "DIFFERENT_TYPE",
"more_information": {
"df_0": "Int64",
"df_1": "String"
}
}
],
"row_differences": [
{
"source": "df_0",
"row": {
"age": [
26
],
"id": [
4
],
"name": [
"George"
]
},
"number_of_occurrences": 1,
"difference_type": "MISSING_ROW",
"more_information": null
},
{
"sources": [
"df_0",
"df_1"
],
"row": {
"age": [
35,
35
],
"id": [
3,
3
],
"name": [
"Charlie",
"David"
]
},
"number_of_occurrences": 2,
"grouping_columns": [
"id"
],
"column_differences": [
"name"
],
"consise_information": {
"id": [
3,
3
],
"name": [
"Charlie",
"David"
],
"source": [
"df_0",
"df_1"
]
},
"row_with_source": {
"age": [
35,
35
],
"id": [
3,
3
],
"name": [
"Charlie",
"David"
],
"source": [
"df_0",
"df_1"
]
}
}
]
}
As You can see the DataReport will have some basic information like:
df0_lengthanddf1_lengthwhich are the lengths of the dataframesdf0_nameanddf1_namewhich are the names of the dataframescomparable_columnswhich are the columns that are comparable between the two dataframescolum_differenceswhich are the columns that have differences between the two dataframes and the type of differencerow_differenceswhich are the rows that have differences between the two dataframes and the type of difference
If you look closely at column_differences you will see that we always refrence the first dataframe as the source. Also You can see that there are different type of differences with more detailed information about the differences.
If you look at row_differences you will see that there are also multiple type of differences. We generally have two types of differences:
RowDifferencewhich is a difference between two rows that couldn't be grouped or there wan't any groupingRowGroupDifferencewhich is a difference between two groups of rows that were grouped by thegrouping_columns
When talking about RowDifference we have the following information:
sourcewhich is the source of the rowrowwhich is original row that is different (keep on mind that it contains only the comparable columns, look at parametercomparable_columns)number_of_occurrenceswhich is the number of times this row is present in the sourcedifference_typewhich is the type of difference (MISSING_ROWsays that the row is missing in the other dataframe)more_informationwhich is more information about the difference (usuallyNone)
When talking about RowGroupDifference we have the following information:
sourceswhich are the sources of the grouped rows (grouped bygrouping_columns)rowwhich are rows present in that group (keep on mind that it contains only the comparable columns, look at parametercomparable_columns)number_of_occurrenceswhich is the number of times this difference is present in all sources (total)grouping_columnswhich are the columns used to group the rowscolumn_differenceswhich are the columns that are differentconsise_informationwhich is a dictionary with more information about the differences containing the grouping columns, source of the row and thecolumn_differencesfrom original data (parameterrow)row_with_sourcewhich is basically the same as therowbut with the source
Now when we have differences we can get tabular information about those differences:
shape: (3, 4)
┌─────┬─────┬─────────┬────────┐
│ age ┆ id ┆ name ┆ source │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str │
╞═════╪═════╪═════════╪════════╡
│ 26 ┆ 4 ┆ George ┆ df_0 │
│ 35 ┆ 3 ┆ Charlie ┆ df_0 │
│ 35 ┆ 3 ┆ David ┆ df_1 │
└─────┴─────┴─────────┴────────┘
We also have an option to gather more information about the differences:
- Get the number of row differences:
from data_fingerprint.src.utils import get_number_of_row_differences
print(get_number_of_row_differences(report))
Output:
3
- Get the number of differences per source:
from data_fingerprint.src.utils import get_number_of_differences_per_source
print(get_number_of_differences_per_source(report))
Output:
{'df_0': 2, 'df_1': 1}
- Get the ratio of differences per source:
from data_fingerprint.src.utils import get_ratio_of_differences_per_source
print(get_ratio_of_differences_per_source(report))
Output:
{'df_0': 0.6666666666666666, 'df_1': 0.3333333333333333}
- Get column difference ratio:
from data_fingerprint.src.utils import get_column_difference_ratio
print(get_column_difference_ratio(report))
Output:
{'age': 0.25, 'name': 0.5, 'id': 0.25}
License
This project is licensed under the GPLv3 License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
Contact
For any questions or feedback, please contact [your email].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_fingerprint-0.1.9.tar.gz.
File metadata
- Download URL: data_fingerprint-0.1.9.tar.gz
- Upload date:
- Size: 25.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3bd3332dfc5130e21cb321e553cc17cade34fe30e0964882cb1a7937d677c006
|
|
| MD5 |
24c21d42311d9f3de89af9de853b3bbf
|
|
| BLAKE2b-256 |
7b61b86fda9830850b6c1dd6415d2396531835bc937af4df9db00fd8555a1578
|
File details
Details for the file data_fingerprint-0.1.9-py3-none-any.whl.
File metadata
- Download URL: data_fingerprint-0.1.9-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b23cdde392d2e7ba57a6c94be180d19b4ddc65fa924bfda96163c3fbed5d4508
|
|
| MD5 |
af946dfa9871770027aa774176d5e75a
|
|
| BLAKE2b-256 |
05aede89357aeec4a5faa12efab1b2c59569f2b0ede2f13b5ddbbbca9b8ad992
|