Data profiling tool with a focus on dataset comparisons

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Data Comparator

Overview

Data Comparator is a pandas-based data profiling tool for quick and modular profiling of two datasets. The primary inspiration for this project was quickly comparing two datasets from a number of different formats after some transformation was applied, but a range of capabilities have/will continue to been implemented.

Data Comparator would be useful for the following scenarios:

Compare old/new (or original/modified) datasets to find general differences
Routine EDA of a dataframe
Compare two datasets of different formats
Profile a dataset during interactive debugging
Compare various columns within the same dataset
Check for specific abnormalities within a dataset

Setup

Use pip to install the Data Comparator package:

Installation

pip install data_comparator

Running

A command line interface and graphical user interface are provided.

Command Line:

import data_comparator.data_comparator as dc

GUI:

python -m data_comparator.app

gui data loading image

gui data detail exmaple

Usage

User can load, profile, validate, and compare datasets as shown below. For the sake of example, I'm using a dataset that provides historical avocado prices.

Loading data

Data can be loaded from a file or dropped into the data column boxes in the first tab. Note that the loading will happen automatically, so carefully drop the files directly into the desired box. I'm (theoretically) working on refining this.

Load From a File

avo2020_dataset = dc.load_dataset(avo_path / "avocado2020.csv", "avo2020")

Load from a (Pandas or Spark) dataframe

avo2019_dataset = dc.load_dataset(avocado2019_df, "avo2019")

Load With Input Parameters

avo2020_adj_dataset = dc.load_dataset(
    data_source=avoPath / "avo2020_adjusted.parquet,
    data_source_name="avo2020_adjusted",
    engine="fastparquet",
    columns=["Date", "AveragePrice", Volume", "year"]
)

Note that PyArrow is the default engine for reading parquets in Data Comparator.

Load Multiple Datasets

avo2017_path = avoPath / "avocado2017.sas7bdat"
avo2018_path = avoPath / "avocado2018.sas7bdat"

avo2017_ds, avo2018_ds = avo2018_dsdc.load_datasets(
    avo2017_path,
    avo2018_path,
    data_source_names=["avo2017", "avo2018"],
    load_params_list=[{},{"iterator":True, "chunksize":1000}]
)

In the snippet above, I'm reading in the 2017 SAS file as is, and reading the 2018 one incrementally - 1000 lines at a time.

Comparing Data

Data from various types can be compared with user-specified columns or all identically-named columns between the datasets. The comparisons are automatically saved for each session.

Compare Datasets

avo2020_ds = dc.getDataset("avo2020")
avo2020_adj_ds = dc.getDataset("avo2020_adjusted)

dc.compare_ds(avo2019_ds, avo2020_adj_ds)

Compare Files

dc.compare(
    avo_path / "avocado2020.csv",
    avo_path / "avo2020_adjusted.parquet"
)

Example Output

comparison exmaple

Other Features

Some metadata for each dataset/comparison object is provided. Here, I use a cosmetic product dataset to illustrate some use cases.

Quick Dataset Summary

Basic metadata and summary information is provided for the dataset object.

skin_care_ds = dc.get_dataset("skin_care")
skin_care_ds.get_summary()

{'path': PosixPath('/path/to/cosmetics_data/skinproduct_vfdemo.sas7bdat'),
 'format': 'sas7bdat',
 'size': '13.56 MB',
 'columns': {'ProductKey': <components.dataset.StringColumn at 0x7f9a05442d30>,
  'DistributionCenter': <components.dataset.StringColumn at 0x7f9a0543fe80>,
  'DATE_CHAR': <components.dataset.StringColumn at 0x7f9a021ac820>,
  'Discount': <components.dataset.NumericColumn at 0x7f9a085c5490>,
  'Revenue': <components.dataset.NumericColumn at 0x7f9a085c5280>},
 'ds_name': 'skin_care',
 'load_time': '0:00:01.062732'}

The dataset object is subscriptable, so you can access individual columns as a subscript. We're accessing the summary for the Revenue column in the snippet below.

skin_care_ds["Revenue"].get_summary()

{'ds_name': 'skin_care',
'name': 'Revenue',
'count': 147070,
'missing': 0,
'data_type': 'NumericColumn',
'min': 0.0,
'max': 1045032.0,
'std': 118382.93241134178,
'mean': 79200.74877269327,
'zeros': 1433}

Perform Checks

I've added some basic data validations for various data types. Use the perform_checks() method to perform the validations. Note that String type comparisons can be computationally expensive; consider using the row_limit flag when perform checks on columns of String type.

skin_care_ds["Revenue"].perform_check()

{'pot_outliers': '4035',
 'susp_skewness': '2.939470744411452',
 'susp_zero_count': ''}

I'm still working out the kinks with some of the checks (numeric checks, like above, to be exact). Check the src/validation_config.json to manage validations.

Coming Attractions

Updates and fixes (mostly here) will be forthcoming. This was a random project that I started for my own practical use in the field, so I'm certainly open to collaboration/feedback. You can drop a comment or find my email below.

Authors

Demerrick Moton (dmoton3.14@gmail.com)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.8.0

Nov 30, 2022

0.7.8

Jul 30, 2021

0.7.7

Jul 30, 2021

This version

0.5.1

Jan 1, 2021

0.5.0

Jan 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-comparator-0.5.1.tar.gz (24.1 kB view hashes)

Uploaded Jan 1, 2021 Source

Built Distribution

data_comparator-0.5.1-py3-none-any.whl (30.8 kB view hashes)

Uploaded Jan 1, 2021 Python 3

Hashes for data-comparator-0.5.1.tar.gz

Hashes for data-comparator-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`9add222541c6d482ca6fe40469de22ffad07820ce0628079debd9532bc83687c`
MD5	`2b31bcc668a10cd262ddd0f334165e5c`
BLAKE2b-256	`185838fef3d796a9038bb2e5392c213845af9e9bfd9c905852a35260b25fedce`

Hashes for data_comparator-0.5.1-py3-none-any.whl

Hashes for data_comparator-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ffdd2be428631145b23d383666f24a533bf42aa1210c554b049380503523587`
MD5	`0c30eeef170798ac9f33e3afbdf985f3`
BLAKE2b-256	`2debd411a022641363cddbb9cd31372ef5d47ec1cf31b92422cfab7947019a81`

data-comparator 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Data Comparator

Overview

Setup

Installation

Running

Command Line:

GUI:

Usage

Loading data

Load From a File

Load from a (Pandas or Spark) dataframe

Load With Input Parameters

Load Multiple Datasets

Comparing Data

Compare Datasets

Compare Files

Example Output

Other Features

Quick Dataset Summary

Perform Checks

Coming Attractions

Authors

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution