Skip to main content

DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.

Project description

DataFingerprint

DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.

Features

  • Column Name Differences: Identify columns that are present in one dataset but missing in the other.
  • Column Data Type Differences: Detect discrepancies in data types between corresponding columns in the two datasets.
  • Row Differences: Find rows that are present in one dataset but missing in the other, or rows that have different values in corresponding columns.
  • Paired Row Differences: Compare rows that have the same primary key or unique identifier in both datasets and identify differences in their values.
  • Data Report: Generate a comprehensive report summarizing all the differences found between the two datasets.
function purpose result
data_fingerprint.src.comparator.get_data_report Get data report object that has all the information about the differences data_fingerprint.src.models.DataReport
data_fingerprint.src.utils.get_dataframe Get polars.Dataframe of rows that are different (added source column) polars.DataFrame
data_fingerprint.src.utils.get_number_of_row_differences Get the number of different rows int
data_fingerprint.src.utils.get_number_of_differences_per_source Get the number of row differences per source dict[str, int]
data_fingerprint.src.utils.get_ratio_of_differences_per_source Get the ratio of row differences per source dict[str, float]
data_fingerprint.src.utils.get_column_difference_ratio [When grouping is used] Get the distribution of differences per column dict[str, float]

Installation

To install DataFingerprint, you can use pip:

pip install data-fingerprint

Usage

Here's a basic example of how to use DataFingerprint to compare two datasets:

import polars as pl

from data_fingerprint.src.utils import get_dataframe
from data_fingerprint.src.comparator import get_data_report
from data_fingerprint.src.models import DataReport

# Create two sample datasets
df1 = pl.DataFrame(
    {"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}
)
df2 = pl.DataFrame(
    {"id": [1, 2, 4], "name": ["Alice", "Bob", "David"], "age": [25, 30, 40]}
)
# Generate a data report comparing the two datasets
report: DataReport = get_data_report(df1, df2, "df_0", "df_1", grouping_columns=["id"])
print(report.model_dump_json(indent=4))
print(get_dataframe(report))

License

This project is licensed under the GPLv3 License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

Contact

For any questions or feedback, please contact [your email].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_fingerprint-0.1.7.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_fingerprint-0.1.7-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file data_fingerprint-0.1.7.tar.gz.

File metadata

  • Download URL: data_fingerprint-0.1.7.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for data_fingerprint-0.1.7.tar.gz
Algorithm Hash digest
SHA256 bc7f49953f27ddbdf0ec829449503f5c6d68b7f1a6de3a48f52ca2b815a9ced7
MD5 340335d93715182a0e3340b49588b618
BLAKE2b-256 ec345d06b526dbedbf1609e0769dad1e2f416ddc4b9433f32897eb28a6ecf35e

See more details on using hashes here.

File details

Details for the file data_fingerprint-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: data_fingerprint-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure

File hashes

Hashes for data_fingerprint-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1795f1a9108a4195fc54afb6c407b41c5afc76d754e0e2d43916f2fe3bae10b5
MD5 e5d36ee6e848e8e5eab332b5708db29f
BLAKE2b-256 063c684094e545e6c4773407b10e3ba33c000e8932ac77115c3732bdfe50e0c4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page