Skip to main content

Data Quality framework for Pyspark jobs

Project description

Owl Data Sanitizer: A light Spark data validation framework

license

This is a small framework for data quality validation. This first version works reading spark dataframes from local datasources like local system, s3 or hive and delivers hive tables with quality reports.

Let's follow this example:

Input data from a hive table:

+----------+--------------+--------+---------+------------------+---------+
|GENERAL_ID|          NAME|    CODE|ADDR_DESC|ULTIMATE_PARENT_ID|PARENT_ID|
+----------+--------------+--------+---------+------------------+---------+
|         1|Dummy 1 Entity|12000123|     null|              null|     null|
|         2|          null|    null|     null|                 2|        2|
|         3|          null|12000123|     null|                 3|        3|
|         4|             1|       1|     null|                 4|        4|
|         5|             1|12000123|     null|                 5|        5|
|         6|          null|       3|     null|                 6|        6|
|      null|          null|12000123|     null|                11|        7|
|         7|             2|    null|     null|                 8|        8|
+----------+--------------+--------+---------+------------------+---------+

following this validation config with 4 sections:

  1. source_table including the table metadata.
  2. correctness_validations including correctness validations per column. the rule must be a valid spark SQL expression.
  3. parent_children_constraints including children parent constrains. This means that any parent id should be valid id.
  4. compare_related_tables_list including comparison with other tables or the same table in other environments.
{
  "source_table": {
    "name": "test.data_test",
    "id_column": "GENERAL_ID",
    "output_correctness_table": "test.data_test_correctness",
    "output_completeness_table": "test.data_test_completeness",
    "output_comparison_table": "test.data_test_comparison"
  },
  "correctness_validations": [
    {
      "column": "CODE",
      "rule": "CODE is not null and CODE != '' and CODE != 'null'"
    },
    {
      "column": "NAME",
      "rule": "NAME is not null and NAME != '' and NAME != 'null'"
    },
    {
      "column": "GENERAL_ID",
      "rule": "GENERAL_ID is not null and GENERAL_ID != '' and GENERAL_ID != 'null' and CHAR_LENGTH(GENERAL_ID) < 4"
    }
  ],
  "completeness_validations": [
    {
      "column": "OVER_ALL_COUNT",
      "rule": "OVER_ALL_COUNT <= 7"
    }
  ],
  "parent_children_constraints": [
    {
      "column": "GENERAL_ID",
      "parent": "ULTIMATE_PARENT_ID"
    },
    {
      "column": "GENERAL_ID",
      "parent": "PARENT_ID"
    }
  ],
  "compare_related_tables_list": ["test.diff_df", "test.diff_df_2"]
}

Therefore, these results are delivered in two output hive tables:

a). Correctness Report.

  • You will see and output col per validation col showing either 1 when there is error or 0 when is clean.
  • Sum of error per columns.
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+
|GENERAL_ID|IS_ERROR_CODE|IS_ERROR_NAME|IS_ERROR_GENERAL_ID|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID|IS_ERROR_GENERAL_ID_PARENT_ID|IS_ERROR__ROW|dt                        |IS_ERROR_CODE_SUM|IS_ERROR_NAME_SUM|IS_ERROR_GENERAL_ID_SUM|IS_ERROR_GENERAL_ID_ULTIMATE_PARENT_ID_SUM|IS_ERROR_GENERAL_ID_PARENT_ID_SUM|IS_ERROR__ROW_SUM|
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+
|null      |0            |1            |1                  |1                                     |0                            |1            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|3         |0            |1            |0                  |0                                     |0                            |1            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|7         |1            |0            |0                  |1                                     |1                            |1            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|5         |0            |0            |0                  |0                                     |0                            |0            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|6         |0            |1            |0                  |0                                     |0                            |1            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|4         |0            |0            |0                  |0                                     |0                            |0            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|2         |1            |1            |0                  |0                                     |0                            |1            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
|1         |0            |0            |0                  |0                                     |0                            |0            |2020-04-17 09:39:04.783505|2                |4                |1                      |2                                         |1                                |5                |
+----------+-------------+-------------+-------------------+--------------------------------------+-----------------------------+-------------+--------------------------+-----------------+-----------------+-----------------------+------------------------------------------+---------------------------------+-----------------+

b) Completeness Report.

  • The overall count of the dataframe.
  • Column checking if the overall count is complete, example: IS_ERROR_OVER_ALL_COUNT.
+--------------+-----------------------+--------------------------+
|OVER_ALL_COUNT|IS_ERROR_OVER_ALL_COUNT|dt                        |
+--------------+-----------------------+--------------------------+
|8             |1                      |2020-04-17 09:39:04.783505|
+--------------+-----------------------+--------------------------+

c). Comparison of schema and values with related dataframes.

NOTE: the result includes for now only the ids that are different and a further join with the source data to see differences is needed.

+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+
|df            |missing_cols_right                |missing_cols_left|missing_vals_right|missing_vals_left|dt                        |
+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+
|test.diff_df_2|GENERAL_ID:string,ADDR_DESC:string|GENERAL_ID:int   |                  |                 |2020-04-17 09:39:07.572483|
|test.diff_df  |                                  |                 |6,7               |                 |2020-04-17 09:39:07.572483|
+--------------+----------------------------------+-----------------+------------------+-----------------+--------------------------+

Installation

Install owl sanitizer from PyPI:

pip install owl-sanitizer-data-quality

Then you can call the library.

from spark_validation.dataframe_validation.dataframe_validator import CreateHiveValidationDF
from spark_validation.common.config import Config

spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()
with open(PATH_TO_CONIGIG_FILE) as f:
        config = Config.parse(f)
CreateHiveValidationDF.validate(spark_session, config)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

owl-sanitizer-data-quality-0.2.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

owl_sanitizer_data_quality-0.2-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file owl-sanitizer-data-quality-0.2.tar.gz.

File metadata

  • Download URL: owl-sanitizer-data-quality-0.2.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for owl-sanitizer-data-quality-0.2.tar.gz
Algorithm Hash digest
SHA256 aae63cd85a01272d423ab614ce7aaa79fa6bb98f5ab6e9365717c62be8878606
MD5 fb109b33751e0c0e51e86feb62350baf
BLAKE2b-256 1e0170ebd5d4a50eb1fdcf151d64a3f7d3a177ecbfb29fd2d8417a44d61bf54f

See more details on using hashes here.

File details

Details for the file owl_sanitizer_data_quality-0.2-py3-none-any.whl.

File metadata

  • Download URL: owl_sanitizer_data_quality-0.2-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for owl_sanitizer_data_quality-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2665f7791733b7e294ddb2ea0c6b25379aed096a95d6113f2e96e0c6cf10d38c
MD5 89607342b967ec9879bfedb66dc7fa37
BLAKE2b-256 deefaac77dbe7c3fb5f62551abbcd6ccc07775f72e011f60440d085cd86998ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page