Project description

ydata-quality

ydata_quality is an open-source python library for assessing Data Quality throughout the multiple stages of a data pipeline development.

A holistic view of the data can only be captured through a look at data from multiple dimensions and ydata_quality evaluates it in a modular way wrapped into a single Data Quality engine. This repository contains the core python source scripts and walkthrough tutorials.

Quickstart

The source code is currently hosted on GitHub at: https://github.com/ydataai/ydata-quality

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ydata-quality

Comprehensive quality check in few lines of code

from ydata_quality import DataQuality
import pandas as pd

#Load in the data
df = pd.read_csv('./datasets/transformed/census_10k.csv')

# create a DataQuality object from the main class that holds all quality modules
dq = DataQuality(df=df)

# run the tests
results = dq.evaluate()

# Output a report of the quality issues found by the engines
dq.report()

Warnings count by priority:
	Priority 1: 1 warning(s)
	Priority 2: 3 warning(s)
	TOTAL: 4 warning(s)
List of warnings sorted by priority:
	[DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns. (Priority 1: heavy impact expected)
	[EXACT DUPLICATES] Found 3 instances with exact duplicate feature values. (Priority 2: usage allowed, limited human intelligibility)
	[FLATLINES] Found 4627 flatline events with a minimun length of 5 among the columns {'marital-status', 'workclass', 'income', 'native-country', 'capital-gain', 'capital-loss', 'education', 'occupation', 'workclass2', 'sex', 'education-num', 'hours-per-week', 'relationship', 'race'}. (Priority 2: usage allowed, limited human intelligibility)
	[PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset. (Priority 2: usage allowed, limited human intelligibility)

Examples

Here you can find walkthrough tutorials and examples to familarize with different modules of ydata_quality

Start Here for Quick and Overall Walkthrough

To dive into any focussed module, and to understand how they work, here are tutorial notebooks:

Contributing

We are open to collaboration! If you want to start contributing you only need to:

Search for an issue in which you would like to work. Issues for newcomers are labeled with good first issue.
Create a PR solving the issue.
We would review every PRs and either accept or ask for revisions.

You can also join the discussions on the #data-quality channel on our Slack and request features/bug fixes by opening issues on our repository.

Support

For support in using this library, please join the #help Slack channel. The Slack community is very friendly and great about quickly answering questions about the use and development of the library. Click here to join our Slack community!

License

GNU General Public License v3.0

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.0

Sep 22, 2021

This version

0.1a1 pre-release

Sep 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ydata-quality-0.1a1.tar.gz (55.2 kB view hashes)

Uploaded Sep 21, 2021 Source

Built Distribution

ydata_quality-0.1a1-py2.py3-none-any.whl (61.7 kB view hashes)

Uploaded Sep 21, 2021 Python 2 Python 3

Hashes for ydata-quality-0.1a1.tar.gz

Hashes for ydata-quality-0.1a1.tar.gz
Algorithm	Hash digest
SHA256	`b89bada90b3613c26c7fb0b2b22a7ac1ca6f6cb0aa97aece284d0c6fe6f978f3`
MD5	`b4a057ebbd48ac51009f097ba26b87d1`
BLAKE2b-256	`a522524cd75b9c12d6535beafa71413e10bc70d222c506b9995fd9d67a2363de`

Hashes for ydata_quality-0.1a1-py2.py3-none-any.whl

Hashes for ydata_quality-0.1a1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`dec3c4416786ca93c01a0cc2eac36a09efeefb12637b0ebe9e6b6e281009c084`
MD5	`0f1b5290b8896cd5ddcabfee348632e3`
BLAKE2b-256	`7672ba244ff82ce970e5b663abdb56a7e7980d2fbf18eced786e5d23f7d77208`