A comprehensive library for data quality checks
Project description
Overview
Whistler is an open source data quality and profiling tool. Whistler enables profiling of your raw data irrespective of size i.e in MB's GB's or even TB's. This module brings the power of Apache Spark execution engine for all your profiling needs.
🐣 Getting Started
1. Install Whistler
pip install dq-whistler
2. Create a Spark dataframe for you data
# Sample Data
Age,Description
1,"abc"
2,"abc1"
3,
4,"abc4"
10,"xyz"
12,"null"
17,"abc"
20,"abc3"
23,
# You can read data from all the supported sources as per Apache Spark module
df = spark.read.option("header", "true").csv("<your path>")
3. Create a config in the form of python dict or read it from any json file
config = [
{
"name": "Age",
"datatype": "number",
"constraints":[
{
"name": "gt_eq",
"values": 5
},
{
"name": "is_in",
"values": [1, 23]
}
]
},
{
"name": "Description",
"datatype": "string",
"constraints":[
{
"name": "regex",
"values": "([A-Za-z]+)"
},
{
"name": "contains",
"values": "abc"
}
]
}
]
4. Build an instance of Data Quality Analyzer and execute the checks
from dq_whistler import DataQualityAnalyzer
output = DataQualityAnalyzer(df, config).analyze()
print(output)
[
{
"col_name": "Age",
"total_count": 9,
"null_count": 0,
"unique_count": 9,
"topn_values": {
"1": 1,
"2": 1,
"3": 1,
"4": 1,
"10": 1,
"12": 1,
"17": 1,
"20": 1,
"23": 1
},
"min": 1,
"max": 23,
"mean": 10.222222222222221,
"stddev": 8.303279138054101,
"quality_score": 0,
"constraints": [
{
"name": "gt_eq",
"values": 5,
"constraint_status": "failed",
"invalid_count": 4,
"invalid_values": [
"1",
"2",
"3",
"4"
]
},
{
"name": "is_in",
"values": [
1,
23
],
"constraint_status": "failed",
"invalid_count": 7,
"invalid_values": [
"2",
"3",
"4",
"10",
"12",
"17",
"20"
]
}
]
},
{
"col_name": "Description",
"total_count": 9,
"null_count": 2,
"unique_count": 7,
"topn_values": {
"abc": 2,
"abc1": 1,
"xyz": 1,
"abc4": 1,
"abc3": 1
},
"quality_score": 0,
"constraints": [
{
"name": "regex",
"values": "([A-Za-z]+)",
"constraint_status": "success",
"invalid_count": 0,
"invalid_values": []
},
{
"name": "contains",
"values": "abc",
"constraint_status": "failed",
"invalid_count": 2,
"invalid_values": [
"xyz",
"null"
]
}
]
}
]
📦 Roadmap
The list below contains the functionality that contributors are planning to develop for this module
- Visualization
- Visualization of profiling output
🎓 Important Resources
👋 Contributing
✨ Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dq_whistler-0.0.1a3.tar.gz
(13.3 kB
view details)
File details
Details for the file dq_whistler-0.0.1a3.tar.gz
.
File metadata
- Download URL: dq_whistler-0.0.1a3.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 669e7514d72b0e8b2e7dbf6ab9413c18d0bc876337ab0d42238de951a2f001ad |
|
MD5 | 802fb963156163d3aa13762eaabd0d31 |
|
BLAKE2b-256 | 9ed6b66a525d9fde3f24d437496313f7efdce582d0b636f6b1ca2dd219350d49 |