Skip to main content

A python package for data quality unit testing.

Project description

{un}swamp

Description

{un}swamp is a python library to create Data Quality Checks that can be run against a pandas dataframe.

Quick Examples

In the following quick example we show the basic concept of:

  • defining a Check Suite (that holds all the checks to run against a dataset)
  • defining a Check and add it to a Check Suite
  • run the Check Suite against a dataset
  • evaluate the Check Result

Dataset

For this Quick Example we use an open dataset from the City of New York that contains the NYS Math Test Results by Grade - Citywide by Race-Ethnicity for the years 2006 - 2011. Further details about the dataset can be found here: https://data.cityofnewyork.us/api/views/825b-niea/. In the following section we'll see a code example that does the following steps:

  • collect the data as pandas dataframe
  • create a Check Suite
  • add two different Checks to that suite (1 shall pass / 1 shall fail)
  • run the Check Suite against the collected dataset
  • evaluate the Check Result to hopefully see a pass rate of 50%

Code

import pandas as pd
from unswamp.objects.checks import CheckColumnsExists, CheckColumnValuesInSet
from unswamp.objects.core import CheckRun, CheckSuite

# We load the dataset into a pandas dataframe
data_file = "https://data.cityofnewyork.us/api/views/825b-niea/rows.csv?accessType=DOWNLOAD"
dataset = pd.read_csv(data_file)

# We generate a CheckSuite to add our checks to
meta_data={"owner": "me"}
suite = CheckSuite("NY-Math-Grades-CheckSuite", "NY-Math-Grades", meta_data)

# We generate a test that checks for columns in the dataset
# The columns are available so the check will be successful
columns = ["Grade", "Year", "Category"]
check = CheckColumnsExists("CHK-001-ColsExists", columns, active=True, meta_data=meta_data)
suite.add_check(check)

# We generate a test that checks if all distinct values in column Year are in the provided values
# The year 2011 is missing so the check will fail
column = "Year"
values = [2006, 2007, 2008, 2009, 2010]
check = CheckColumnValuesInSet("CHK-002-ColsValuesInSet", column, values, active=True, meta_data=meta_data)
suite.add_check(check)

# We run the suite against the dataset and print the pass rate
# The pass rate is expected to be 50% since 1 test is successful and one fails
check_run = suite.run(dataset, "manual-test-run")
print(f"passed - {check_run.pass_rate*100}%")

Credits

security: bandit security: bandit security: bandit security: bandit

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unswamp-1.0.7.2.2.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

unswamp-1.0.7.2.2-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file unswamp-1.0.7.2.2.tar.gz.

File metadata

  • Download URL: unswamp-1.0.7.2.2.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.1

File hashes

Hashes for unswamp-1.0.7.2.2.tar.gz
Algorithm Hash digest
SHA256 e484a5615d2f4634759385ca86060ae1fa8b5aaedd6a10d0dace65f63369f78e
MD5 20b6e78ee57bdb0952b5c05189a7eae9
BLAKE2b-256 53ff8d175d32d963a0a12dcb0f09bd696fd0f61b515bfe4f1c5684ea85a91afe

See more details on using hashes here.

File details

Details for the file unswamp-1.0.7.2.2-py3-none-any.whl.

File metadata

  • Download URL: unswamp-1.0.7.2.2-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.1

File hashes

Hashes for unswamp-1.0.7.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 214d3aea0be255e0cc132b61b0091558f13950535fe996c6cd3085e104edb188
MD5 e77f48e6e14bbf31575998a526779014
BLAKE2b-256 b434cef6334bbc26c23b103c315e92288b5e6d109454bcd916311893625aad41

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page