A python package for data quality unit testing.
Project description
{un}swamp
Description
{un}swamp is a python library to create Data Quality Checks that can be run against a pandas dataframe.
Quick Examples
In the following quick example we show the basic concept of:
- defining a Check Suite (that holds all the checks to run against a dataset)
- defining a Check and add it to a Check Suite
- run the Check Suite against a dataset
- evaluate the Check Result
Dataset
For this Quick Example we use an open dataset from the City of New York that contains the NYS Math Test Results by Grade - Citywide by Race-Ethnicity for the years 2006 - 2011. Further details about the dataset can be found here: https://data.cityofnewyork.us/api/views/825b-niea/. In the following section we'll see a code example that does the following steps:
- collect the data as pandas dataframe
- create a Check Suite
- add two different Checks to that suite (1 shall pass / 1 shall fail)
- run the Check Suite against the collected dataset
- evaluate the Check Result to hopefully see a pass rate of 50%
Code
import pandas as pd
from unswamp.objects.checks import CheckColumnsExists, CheckColumnValuesInSet
from unswamp.objects.core import CheckRun, CheckSuite
# We load the dataset into a pandas dataframe
data_file = "https://data.cityofnewyork.us/api/views/825b-niea/rows.csv?accessType=DOWNLOAD"
dataset = pd.read_csv(data_file)
# We generate a CheckSuite to add our checks to
meta_data={"owner": "me"}
suite = CheckSuite("NY-Math-Grades-CheckSuite", "NY-Math-Grades", meta_data)
# We generate a test that checks for columns in the dataset
# The columns are available so the check will be successful
columns = ["Grade", "Year", "Category"]
check = CheckColumnsExists("CHK-001-ColsExists", columns, active=True, meta_data=meta_data)
suite.add_check(check)
# We generate a test that checks if all distinct values in column Year are in the provided values
# The year 2011 is missing so the check will fail
column = "Year"
values = [2006, 2007, 2008, 2009, 2010]
check = CheckColumnValuesInSet("CHK-002-ColsValuesInSet", column, values, active=True, meta_data=meta_data)
suite.add_check(check)
# We run the suite against the dataset and print the pass rate
# The pass rate is expected to be 50% since 1 test is successful and one fails
check_run = suite.run(dataset, "manual-test-run")
print(f"passed - {check_run.pass_rate*100}%")
Credits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file unswamp-1.0.7.2.2.tar.gz
.
File metadata
- Download URL: unswamp-1.0.7.2.2.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e484a5615d2f4634759385ca86060ae1fa8b5aaedd6a10d0dace65f63369f78e |
|
MD5 | 20b6e78ee57bdb0952b5c05189a7eae9 |
|
BLAKE2b-256 | 53ff8d175d32d963a0a12dcb0f09bd696fd0f61b515bfe4f1c5684ea85a91afe |
File details
Details for the file unswamp-1.0.7.2.2-py3-none-any.whl
.
File metadata
- Download URL: unswamp-1.0.7.2.2-py3-none-any.whl
- Upload date:
- Size: 51.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.5.0.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 214d3aea0be255e0cc132b61b0091558f13950535fe996c6cd3085e104edb188 |
|
MD5 | e77f48e6e14bbf31575998a526779014 |
|
BLAKE2b-256 | b434cef6334bbc26c23b103c315e92288b5e6d109454bcd916311893625aad41 |