Skip to main content

Data unit testing for your Python DataFrames

Project description

https://img.shields.io/pypi/v/hooqu.svg https://travis-ci.com/mfcabrera/hooqu.svg?token=pq89mpsBBBTg11hAgCHH&branch=master Documentation Status https://codecov.io/gh/mfcabrera/hooqu/branch/master/graph/badge.svg Updates

Documentation: https://hooqu.readthedocs.io

Source Code: https://github.com/mfcabrera/hooqu


Install

Hooqu requires Pandas >= 1.0 and Python >= 3.7. To install via pip use:

pip install hooqu

Quick Start

import pandas as pd

# data to validate
df = pd.DataFrame(
       [
           (1, "Thingy A", "awesome thing.", "high", 0),
           (2, "Thingy B", "available at http://thingb.com", None, 0),
           (3, None, None, "low", 5),
           (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
           (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

Checks we want to perform:

  • there are 5 rows in total

  • values of the id attribute are never Null/None and unique

  • values of the productName attribute are never null/None

  • the priority attribute can only contain “high” or “low” as value

  • numViews should not contain negative values

  • at least half of the values in description should contain a url

  • the median of numViews should be less than or equal to 10

In code this looks as follows:

from hooqu.checks import Check, CheckLevel, CheckStatus
from hooqu.verification_suite import VerificationSuite
from hooqu.constraints import ConstraintStatus


verification_result = (
      VerificationSuite()
      .on_data(df)
      .add_check(
          Check(CheckLevel.ERROR, "Basic Check")
          .has_size(lambda sz: sz == 5)  # we expect 5 rows
          .is_complete("id")  # should never be None/Null
          .is_unique("id")  # should not contain duplicates
          .is_complete("productName")  # should never be None/Null
          .is_contained_in("priority", ("high", "low"))
          .is_non_negative("numViews")
          # .contains_url("description", lambda d: d >= 0.5) (NOT YET IMPLEMENTED)
          .has_quantile("numViews", 0.5, lambda v: v <= 10)
      )
      .run()
)

After calling run, hooqu will compute some metrics on the data. Afterwards it invokes your assertion functions (e.g., lambda sz: sz == 5 for the size check) on these metrics to see if the constraints hold on the data.

We can inspect the VerificationResult to see if the test found errors:

if verification_result.status == CheckStatus.SUCCESS:
      print("Alles klar: The data passed the test, everything is fine!")
else:
      print("We found errors in the data")

for check_result in verification_result.check_results.values():
      for cr in check_result.constraint_results:
          if cr.status != ConstraintStatus.SUCCESS:
              print(f"{cr.constraint}: {cr.message}")

If we run the example, we get the following output:

We found errors in the data
CompletenessConstraint(Completeness(productName)): Value 0.8 does not meet the constraint requirement.

The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. Please use GitHub issues: for bug reports, feature requests, install issues, RFCs, thoughts, etc.

See the full cotributing guide for more information.

Why Hooqu?

  • Easy to use declarative API to add data verification steps to your data processing pipeline.

  • The VerificationResult allows you know not only what check fail but the values of the computed metric, allowing for flexible handling of issues with the data.

  • Incremental metric computation capability allows to compare quality metrics change across time (planned).

  • Support for storing and loading computed metrics (planned).

References

This project is a “spiritual” port of Apache Deequ and thus tries to implement the declarative API described on the paper “Automating large-scale data quality verification” while trying to remain pythonic as much as possible. This project does not use (py)Spark but rather Pandas (and hopefully in the future it will support other compatible dataframe implementations).

Name

Jukumari (pronounced hooqumari) is the Aymara name for the spectacled bear (Tremarctos ornatus), also known as the Andean bear, Andean short-faced bear, or mountain bear.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.1.0] - 2020-08-26

Initial release. The following checks are available:

  • has_completeness

  • has_max

  • has_mean

  • has_min

  • has_quantile

  • has_size

  • has_standard_deviation

  • has_sum

  • has_uniqueness

  • is_complete

  • is_contained_in

  • is_contained_in_range

  • is_non_negative

  • is_positive

  • is_unique

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hooqu-0.1.0.zip (76.4 kB view details)

Uploaded Source

Built Distribution

hooqu-0.1.0-3-py2.py3-none-any.whl (49.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file hooqu-0.1.0.zip.

File metadata

  • Download URL: hooqu-0.1.0.zip
  • Upload date:
  • Size: 76.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1

File hashes

Hashes for hooqu-0.1.0.zip
Algorithm Hash digest
SHA256 2f026766c46f0b058f3492485b61c3e8b76c55caf9e676a1ce55805571866006
MD5 ff2c83040bd09e294e0943a087123f96
BLAKE2b-256 cb2c00adb9031fe3bd94b8e412c80b8c113aa01637beb81b6c212599bac35a7a

See more details on using hashes here.

File details

Details for the file hooqu-0.1.0-3-py2.py3-none-any.whl.

File metadata

  • Download URL: hooqu-0.1.0-3-py2.py3-none-any.whl
  • Upload date:
  • Size: 49.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.1

File hashes

Hashes for hooqu-0.1.0-3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 499a61cbb56001c225a59fdbc1ce6e5e057a9b6b7b8439240bb47b9d9c6e44be
MD5 36a5e413f0dfd34fe53336aa9331b2b9
BLAKE2b-256 7415ddaad5f1007eff18de13ca216faa8830dd76617a0ef6b18e9e79a49a494f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page