A python package for data quality unit testing.
Project description
{un}swamp
Description
{un}swamp is a python library to create Data Quality Checks that can be run against a pandas dataframe.
Quick Examples
In the following quick example we show the basic concept of:
- defining a Check Suite (that holds all the checks to run against a dataset)
- defining a Check and add it to a Check Suite
- run the Check Suite against a dataset
- evaluate the Check Result
Dataset
For this Quick Example we use an open dataset from the City of New York that contains the NYS Math Test Results by Grade - Citywide by Race-Ethnicity for the years 2006 - 2011. Further details about the dataset can be found here: https://data.cityofnewyork.us/api/views/825b-niea/. In the following section we'll see a code example that does the following steps:
- collect the data as pandas dataframe
- create a Check Suite
- add two different Checks to that suite (1 shall pass / 1 shall fail)
- run the Check Suite against the collected dataset
- evaluate the Check Result to hopefully see a pass rate of 50%
Code
import pandas as pd
from unswamp.objects.checks import CheckColumnsExists, CheckColumnValuesInSet
from unswamp.objects.core import CheckRun, CheckSuite
# We load the dataset into a pandas dataframe
data_file = "https://data.cityofnewyork.us/api/views/825b-niea/rows.csv?accessType=DOWNLOAD"
dataset = pd.read_csv(data_file)
# We generate a CheckSuite to add our checks to
meta_data={"owner": "me"}
suite = CheckSuite("NY-Math-Grades-CheckSuite", "NY-Math-Grades", meta_data)
# We generate a test that checks for columns in the dataset
# The columns are available so the check will be successful
columns = ["Grade", "Year", "Category"]
check = CheckColumnsExists("CHK-001-ColsExists", columns, active=True, meta_data=meta_data)
suite.add_check(check)
# We generate a test that checks if all distinct values in column Year are in the provided values
# The year 2011 is missing so the check will fail
column = "Year"
values = [2006, 2007, 2008, 2009, 2010]
check = CheckColumnValuesInSet("CHK-002-ColsValuesInSet", column, values, active=True, meta_data=meta_data)
suite.add_check(check)
# We run the suite against the dataset and print the pass rate
# The pass rate is expected to be 50% since 1 test is successful and one fails
check_run = suite.run(dataset, "manual-test-run")
print(f"passed - {check_run.pass_rate*100}%")
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unswamp-1.0.7.2.2.tar.gz
(24.1 kB
view hashes)
Built Distribution
Close
Hashes for unswamp-1.0.7.2.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 214d3aea0be255e0cc132b61b0091558f13950535fe996c6cd3085e104edb188 |
|
MD5 | e77f48e6e14bbf31575998a526779014 |
|
BLAKE2b-256 | b434cef6334bbc26c23b103c315e92288b5e6d109454bcd916311893625aad41 |