Data profiling tool with a focus on dataset comparisons

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Data Comparator

Overview

Data Comparator is a pandas-based data profiling tool for quick and modular profiling of two datasets. The primary inspiration for this project was quickly comparing two datasets from a number of different formats after some transformation was applied, but a range of capabilities have/will continue to been implemented.

Data Comparator would be useful for the following scenarios:

Compare old/new (or original/modified) datasets to find general differences
Routine EDA of a dataframe
Compare two datasets of different formats
Profile a dataset during interactive debugging
Compare various columns within the same dataset
Check for specific abnormalities within a dataset
Export a comparison in HTML form

Setup

Use pip to install the Data Comparator package:

Installation

pip install data_comparator

Running

A command line interface and graphical user interface are provided.

Command Line:

Run the following in a script:

import data_comparator.data_comparator as dc

GUI:

Run the folllowing in a command line:

data_comparator

gui data loading image gui data detail exmaple

Export a comparison to an HTML report: gui export tab gui htmp report

Usage

User can load, profile, validate, and compare datasets as shown below. For the sake of example, I'm using a dataset that provides historical avocado prices.

Loading data

Data can be loaded from a file or dropped into the data column boxes in the Data Loading tab in the GUI. Note that the loading will happen automatically, so carefully drop the files directly into the desired box.

Load From a File

avo2020_dataset = dc.load_dataset(avo_path / "avocado2020.csv", "avo2020")

Load from a (Pandas or Spark) dataframe

avo2019_dataset = dc.load_dataset(avocado2019_df, "avo2019")

Load With Input Parameters

avo2020_adj_dataset = dc.load_dataset(
    data_source=avoPath / "avo2020_adjusted.parquet,
    data_source_name="avo2020_adjusted",
    engine="fastparquet",
    columns=["Date", "AveragePrice", Volume", "year"]
)

Note that PyArrow is the default engine for reading parquets in Data Comparator.

Load Multiple Datasets

avo2017_path = avoPath / "avocado2017.sas7bdat"
avo2018_path = avoPath / "avocado2018.sas7bdat"

avo2017_ds, avo2018_ds = avo2018_dsdc.load_datasets(
    avo2017_path,
    avo2018_path,
    data_source_names=["avo2017", "avo2018"],
    load_params_list=[{},{"iterator":True, "chunksize":1000}]
)

In the snippet above, I'm reading in the 2017 SAS file as is, and reading the 2018 one incrementally - 1000 lines at a time.

Comparing Data

Data from various types can be compared with user-specified columns or all identically-named columns between the datasets. The comparisons are automatically saved for each session.

Compare Datasets

avo2020_ds = dc.getDataset("avo2020")
avo2020_adj_ds = dc.getDataset("avo2020_adjusted)

dc.compare_ds(avo2019_ds, avo2020_adj_ds)

Compare Files

dc.compare(
    avo_path / "avocado2020.csv",
    avo_path / "avo2020_adjusted.parquet"
)

Example Output

comparison exmaple

Other Features

Some metadata for each dataset/comparison object is provided. Here, I use a cosmetic product dataset to illustrate some use cases.

Quick Dataset Summary

Basic metadata and summary information is provided for the dataset object.

skin_care_ds = dc.get_dataset("skin_care")
skin_care_ds.get_summary()

{'path': PosixPath('/path/to/cosmetics_data/skinproduct_vfdemo.sas7bdat'),
 'format': 'sas7bdat',
 'size': '13.56 MB',
 'columns': {'ProductKey': <components.dataset.StringColumn at 0x7f9a05442d30>,
  'DistributionCenter': <components.dataset.StringColumn at 0x7f9a0543fe80>,
  'DATE_CHAR': <components.dataset.StringColumn at 0x7f9a021ac820>,
  'Discount': <components.dataset.NumericColumn at 0x7f9a085c5490>,
  'Revenue': <components.dataset.NumericColumn at 0x7f9a085c5280>},
 'ds_name': 'skin_care',
 'load_time': '0:00:01.062732'}

The dataset object is subscriptable, so you can access individual columns as a subscript. We're accessing the summary for the Revenue column in the snippet below.

skin_care_ds["Revenue"].get_summary()

{'ds_name': 'skin_care',
'name': 'Revenue',
'count': 147070,
'missing': 0,
'data_type': 'NumericColumn',
'min': 0.0,
'max': 1045032.0,
'std': 118382.93241134178,
'mean': 79200.74877269327,
'zeros': 1433}

Perform Checks

I've added some basic data validations for various data types. Use the perform_checks() method to perform the validations. Note that String type comparisons can be computationally expensive; consider using the row_limit flag when perform checks on columns of String type.

skin_care_ds["Revenue"].perform_check()

{'pot_outliers': '4035',
 'susp_skewness': '2.939470744411452',
 'susp_zero_count': ''}

I'm still working out the kinks with some of the checks (numeric checks, like above, to be exact). Check the src/validation_config.json to manage validations.

Coming Attractions

Updates and fixes (mostly here) will be forthcoming. This was a random project that I started for my own practical use in the field, so I'm certainly open to collaboration/feedback. You can drop a comment or find my email below.

Authors

Demerrick Moton (dmoton3.14@gmail.com)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.8.0

Nov 30, 2022

0.7.8

Jul 30, 2021

0.7.7

Jul 30, 2021

0.5.1

Jan 1, 2021

0.5.0

Jan 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-comparator-0.8.0.tar.gz (31.0 kB view details)

Uploaded Nov 30, 2022 Source

Built Distribution

data_comparator-0.8.0-py3-none-any.whl (35.6 kB view details)

Uploaded Nov 30, 2022 Python 3

File details

Details for the file data-comparator-0.8.0.tar.gz.

File metadata

Download URL: data-comparator-0.8.0.tar.gz
Upload date: Nov 30, 2022
Size: 31.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.5

File hashes

Hashes for data-comparator-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`8b3928a174cd15caed93b9a82e7700cedb9ca9b735d378877113440f4d55c595`
MD5	`2fc768f57e44579a751b94306b6b1b0b`
BLAKE2b-256	`1bd5f1d329fab05986612025ab954fc705d1423a4aab96013476d5d74e44d1e7`

See more details on using hashes here.

File details

Details for the file data_comparator-0.8.0-py3-none-any.whl.

File metadata

Download URL: data_comparator-0.8.0-py3-none-any.whl
Upload date: Nov 30, 2022
Size: 35.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.1 pkginfo/1.8.2 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.64.0 CPython/3.9.5

File hashes

Hashes for data_comparator-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`23863779c2eecb5d2ee1b38150c62f65a8432bc5ccd9ab92d52cd5ad5ec92798`
MD5	`6f4494cc7fb1fc0a0c9038f7b6722f33`
BLAKE2b-256	`d85b39bbf7305ba3981e120cae33828d8e6fb9e684aaf470c4432568f5fa2666`

See more details on using hashes here.

data-comparator 0.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Comparator

Overview

Setup

Installation

Running

Command Line:

GUI:

Usage

Loading data

Load From a File

Load from a (Pandas or Spark) dataframe

Load With Input Parameters

Load Multiple Datasets

Comparing Data

Compare Datasets

Compare Files

Example Output

Other Features

Quick Dataset Summary

Perform Checks

Coming Attractions

Authors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes