Spark data quality check tool

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Framework
- IPython
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Project description

Data-Quality-Check

DQC

Requirements

Python 3.7+
Java 8+
Apache Spark 3.0+

Usage

Installation

pip install --upgrade data-quality-check

# Install Spark if needed
pip install pyspark

Quick Start

from data_quality_check.config import Config
from data_quality_check.profiler.combined_profiler import CombinedProfiler
from data_quality_check.report.renders.html.render import render_all

config_dict = {
    'dataset': {'name': 'mydb.my_table'},
    'profiling': {
        'general': {'columns': ['*']}
    }
}
config = Config().parse_obj(config_dict)
profiler = CombinedProfiler(spark, config=config)
result = profiler.run()
html = render_all(all_pr=result)

# Present in Jupyter notebooks
from IPython.core.display import display, HTML
display(HTML(html))

# Present in Databricks notebooks
displayHTML(html)

# Save to a html file
f = open("report.html", "w")
f.write(html)
f.close()

If you do not have a ready-to-use spark session, use the codes below to create one:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").enableHiveSupport().getOrCreate()

Development

Dependencies

Filename	Requirements
requirements.txt	Package requirements
requirements-dev.txt	Requirements for development

Test

PYTHONPATH=./src pytest tests/*

Build

python setup.py sdist bdist_wheel && twine check dist/*

Publish

twine upload --repository-url https://test.pypi.org/legacy/ dist/*
twine upload dist/*

Manual

Profiling Check

There are 2 types of useful profilers : GeneralProfiler and CustomizedProfiler. If you would like to run both profilers on your dataset. You can use CombinedProfiler which will run both profilers.

Combined Profiler

The easiest way to run a combined profiler(mix of general and customized profiler) on you dataset:

Example of running combined profiling

from data_quality_check.config import Config
from data_quality_check.profiler.combined_profiler import CombinedProfiler
from data_quality_check.report.renders.html.render import render_all

config_dict = {
    'dataset': {'name': 'my_table'},
    'profiling': {
        'general': {'columns': ['*']}
    },
    'customized': {
            'code_check': [
                {'column': 'my_code_col', 'codes': ['A', 'B', 'C', 'D']}
            ]
    }
}
config = Config().parse_obj(config_dict)
profiler = CombinedProfiler(spark, config=config)
result = profiler.run()
html = render_all(all_pr=result)

displayHTML(html)

General Profiler

from pyspark.sql import SparkSession
from data_quality_check.config import ConfigDataset
from data_quality_check.profiler.general_profiler import GeneralProfiler

spark = SparkSession.builder.appName("SparkProfilingApp").enableHiveSupport().getOrCreate()
data = [{'name': 'Alice', 'age': 1, 'gender': 'female', 'is_new': True},
        {'name': 'Tom', 'age': 10, 'gender': 'male', 'is_new': False}]

# Run general check on spark df
df = spark.createDataFrame(data)
result_df = GeneralProfiler(spark, df=df).run(return_type='dataframe')
result_df.show()

# Run general check on spark/hive table
df.createOrReplaceTempView('my_table')
result_df = GeneralProfiler(spark, dataset_config=ConfigDataset(name='my_table')).run(return_type='dataframe')
result_df.show()

Customized Profiler

import json

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
from data_quality_check.config import Config, ConfigDataset, ConfigProfilingCustomized
from data_quality_check.profiler.customized_profiler import CustomizedProfiler

# Initialize spark
spark = SparkSession.builder.appName("SparkProfilingApp").enableHiveSupport().getOrCreate()
dept = [("Finance", 1),
        ("Marketing", 2),
        ("Sales", 3),
        ("IT", 4)]
deptSchema = StructType([StructField('dept_name', StringType(), True),
                         StructField('dept_id', LongType(), True)])
spark.createDataFrame(data=dept, schema=deptSchema).createOrReplaceTempView('dept')
print('dept table:')
spark.table('dept').show(truncate=False)

employee = [(1, "Amy", 1, 'male', 1000, 'amy@example.com'),
            (2, "Caro", 2, 'male', 1000, 'caro@example.com'),
            (3, "Mark", 3, 'Error', 2000, 'unknown'),
            (4, "Timi", 4, 'female', 2000, None),
            (5, "Tata", 5, 'unknown', 3000, 'bad email address'),
            (6, "Zolo", None, None, 3000, 'my-C0omplicated_EMAIL@A.ddress.xyz')]
employeeSchema = StructType([StructField('uid', LongType(), True),
                             StructField('name', StringType(), True),
                             StructField('dept_id', LongType(), True),
                             StructField('gender', StringType(), True),
                             StructField('income', LongType(), True),
                             StructField('email', StringType(), True)])
spark.createDataFrame(data=employee, schema=employeeSchema).createOrReplaceTempView('employee')
print('employee table:')
spark.table('employee').show(truncate=False)

# Specify the configuration of customized profiler
customized_config_dict = {
    'code_check': [
        {'column': 'gender', 'codes': ['male', 'female', 'unknown']}
    ],
    'key_mapping_check': [
        {'column': 'dept_id', 'target_table': 'dept', 'target_column': 'dept_id'}
    ]
}

customized_config = ConfigProfilingCustomized.parse_obj(customized_config_dict)
dataset_config = ConfigDataset.parse_obj({'name': 'employee'})

# Initialize CustomizedProfiler with configuration
customized_profiler = CustomizedProfiler(spark,
                                         dataset_config=dataset_config,
                                         customized_profiling_config=customized_config)

result = customized_profiler.run(return_type='dict')
print(json.dumps(result, indent=' ', ensure_ascii=False, allow_nan=True))

Expectation Verification

To be done.

Supported Checks and Expectation

Profiler Type	Check Type	Render result as HTML?	Support Expectation?	Description
General	Distinct Values Count	YES	Will DO	Number of unique values in a given column. This equals to Unique Row Count
General	Null Row Count	YES	Will DO	Null row count in a given column
General	Empty Row Count	YES	Will DO	Empty/Blank text row count in a given column
General	Zero Row Count	YES	Will DO	0-valued row count in a given column
General	Valued Row Count	YES	Will DO	Number of rows which are not null in a given column
General	Total Row Count	YES	Will DO	Number of total rows
General	Unique Row Count	YES	Will DO	Number of rows that have unique value
General	Duplicated Valued Row Count	YES	Will DO	Number of rows that have duplicated values
General	Minimum Value	YES	Will DO	Minimum value
General	Maximum Value	YES	Will DO	Maximum value
General	Mean Value	YES	Will DO	Mean/average value
General	Standard Deviation Value	YES	Will DO	Standard deviation value of a column
General	Values Count	YES	Will DO	Number of values in a given column
---	---	---	---	---
Customized	Code Check	YES	Will DO	Check if values from the columns are in the given(expected) codes list
Customized	Key Mapping Check	Will DO	Will DO	Find the values from this table column that do not exist in the target(another) table's column. Hint: target table usually is dim table

Will DO = Is scheduled to be developed, but not implemented yet.

Expectation Type	Scope	Description
ExpectColumnToExist	---	---
...	...	...

Project details

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Environment
- Console
Framework
- IPython
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.0.20

Mar 15, 2022

0.0.19

Mar 14, 2022

0.0.18

Mar 14, 2022

0.0.17

Mar 7, 2022

0.0.16

Mar 7, 2022

0.0.15

Mar 7, 2022

0.0.14

Mar 7, 2022

0.0.13

Mar 7, 2022

0.0.12

Mar 7, 2022

0.0.11

Mar 7, 2022

0.0.10

Mar 7, 2022

0.0.9

Mar 7, 2022

0.0.8

Mar 2, 2022

0.0.7

Feb 24, 2022

0.0.5

Feb 22, 2022

0.0.4

Feb 22, 2022

0.0.3

Feb 22, 2022

0.0.2

Feb 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-quality-check-0.0.20.tar.gz (26.3 kB view hashes)

Uploaded Mar 15, 2022 Source

Built Distribution

data_quality_check-0.0.20-py3-none-any.whl (31.5 kB view hashes)

Uploaded Mar 15, 2022 Python 3

Hashes for data-quality-check-0.0.20.tar.gz

Hashes for data-quality-check-0.0.20.tar.gz
Algorithm	Hash digest
SHA256	`8a8c01aa898075c0f4898db934ce7c9db51382255151091951bf3c5325ec6620`
MD5	`cd346cc7bf82a1969915a3988b2ac1af`
BLAKE2b-256	`9713a719611f44ebc49ba5d179e555a0cb7633c51fc0d686e1bc3d97d6bfca06`

Hashes for data_quality_check-0.0.20-py3-none-any.whl

Hashes for data_quality_check-0.0.20-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1178ec96e182aa6783851e023d13a22c76e8ad003db3fc3a6e7b6c72f6467fd3`
MD5	`5921cf40f5af3d1873b5bd730aef2f40`
BLAKE2b-256	`d05ad0a73c93db3d8facd0a80856265e05893e3e98c1167656c741e0ab281857`