DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.
Project description
DataFingerprint
DataFingerprint is a Python package designed to compare two datasets and generate a detailed report highlighting the differences between them. This tool is particularly useful for data validation, quality assurance, and ensuring data consistency across different sources.
Features
- Column Name Differences: Identify columns that are present in one dataset but missing in the other.
- Column Data Type Differences: Detect discrepancies in data types between corresponding columns in the two datasets.
- Row Differences: Find rows that are present in one dataset but missing in the other, or rows that have different values in corresponding columns.
- Paired Row Differences: Compare rows that have the same primary key or unique identifier in both datasets and identify differences in their values.
- Data Report: Generate a comprehensive report summarizing all the differences found between the two datasets.
Installation
To install DataFingerprint, you can use pip:
pip install data-fingerprint
Usage
Here's a basic example of how to use DataFingerprint to compare two datasets:
import polars as pl
from data_fingerprint.src.utils import get_dataframe
from data_fingerprint.src.comparator import get_data_report
from data_fingerprint.src.models import DataReport
# Create two sample datasets
df1 = pl.DataFrame(
{"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35]}
)
df2 = pl.DataFrame(
{"id": [1, 2, 4], "name": ["Alice", "Bob", "David"], "age": [25, 30, 40]}
)
# Generate a data report comparing the two datasets
report: DataReport = get_data_report(df1, df2, "df_0", "df_1", grouping_columns=["id"])
print(report.model_dump_json(indent=4))
print(get_dataframe(report))
License
This project is licensed under the GPLv3 License. See the LICENSE file for details.
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
Contact
For any questions or feedback, please contact [your email].
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_fingerprint-0.1.5.tar.gz.
File metadata
- Download URL: data_fingerprint-0.1.5.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fe92668625223bcdb0e8865be954c5e975121ca6dffe3737eb686023b55ce61
|
|
| MD5 |
25984fd276ec3214b7b1c976fde63d21
|
|
| BLAKE2b-256 |
a0b70ae9cee9632efb619950c014181bb3cb4b0c4633583e9e6d6bfd37cffa71
|
File details
Details for the file data_fingerprint-0.1.5-py3-none-any.whl.
File metadata
- Download URL: data_fingerprint-0.1.5-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.2 CPython/3.13.2 Linux/6.8.0-1021-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7aaeac07dbe2b830e96a054cf0fba46eeedb8d188c0bafd4a4b352fa3e3062bf
|
|
| MD5 |
068271ecf4d7c34567628caf0cecd05f
|
|
| BLAKE2b-256 |
d5e968232e813c8e7c97f8a20a7dd1e311dc800be4f600725902faf1f4c04650
|