A simple Data Quality Assessment Tool based on SQL
Project description
dataviper
dataviper
is a SQL-based tool to get the basic data preparation done in easy way, with doing
- Create "Data Profile" report of a table
- One-hot encode for "Categorical Columns" and create a "one-hot" table
- Check "Joinability" between 2 tables
- // TODO: and more
Example
pip install dataviper
Your main.py
will look like this
from dataviper import Client
from dataviper.source import MySQL
client = Client(source=MySQL({
'host': 'localhost',
'user': 'root',
'password': 'xxxxxx',
'database': 'your_database'
}))
with client.connect():
profile = client.profile('Your_Table')
profile.to_excel()
python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗
Why?
It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas
or any libraries which internally use pandas
, it's not avoidable to use bunch of memory of your local machine and freeze your work.
dataviper
is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.
With dataviper
, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.
Data Sources
You can choose your data source from
- SQL Server
-
profile
-
pivot
-
joinability
-
histogram
-
- MySQL
-
profile
-
pivot
-
joinability
-
histogram
-
- PostgreSQL
- CSV
-
profile
-
pivot
-
joinability
-
histogram
-
- Excel
APIs
profile
Create "Data Profile" excel file of a specified table.
When you have Sales
table like this
id | region | sales_type | price | rep_id |
---|---|---|---|---|
1 | jp | phone | 240 | 115723 |
2 | us | web | 90 | 125901 |
3 | jp | web | 560 | 8003 |
4 | us | shop | 920 | 182234 |
5 | jp | NULL | 90 | 92231 |
6 | us | shop | 180 | 100425 |
7 | us | shop | 70 | 52934 |
do
with client.connect() as conn:
table_name = 'Sales'
profile = client.profile(table_name, example_count=3)
profile.to_excel()
then you will get profile_Sales.xlsx
file with
column_name | data_type | null_count | null_% | unique_count | unique_% | min | max | avg | std | example_top_3 | example_last_3 |
---|---|---|---|---|---|---|---|---|---|---|---|
id | bigint | 0 | 0 | 7 | 100.00 | 1 | 7 | 4.0 | 2.0 | [1,2,3] | [5,6,7] |
region | varchar | 0 | 0 | 2 | 28.57 | [jp,us,jp] | [jp,us,us] | ||||
sales_type | varchar | 1 | 14.28 | 3 | 42.85 | [phone,web,web] | [None,shop,shop] | ||||
price | int | 0 | 0 | 6 | 85.71 | 70 | 920 | 307.1428 | 295.379 | [240,90,560] | [90,180,70] |
rep_id | int | 0 | 0 | 7 | 100.00 | 8003 | 182234 | 96778.7142 | 51195.79065 | [115723,125901,8003] | [92231,100425,52934] |
pivot
Spread categorical columns to N binary columns.
When you have Sales
table like above, do
with client.connect() as conn:
table_name = 'Sales'
key = 'id'
categorical_columns = ['region', 'sales_type']
profile = client.get_schema(table_name)
client.pivot(profile, key, categorical_columns)
then you will get Sales_PIVOT_YYYYmmddHHMM
table with
id | region_jp | region_us | sales_type_phone | sales_type_web | sales_type_shop |
---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 1 | 0 |
4 | 0 | 1 | 0 | 0 | 1 |
5 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | 1 | 0 | 0 | 1 |
7 | 0 | 1 | 0 | 0 | 1 |
joinability
Count how much 2 tables can be joined.
When you have Sales
table like above, and Reps
table like this
id | name | tenure |
---|---|---|
8003 | Hiromu | 9 |
8972 | Ochiai | 6 |
52934 | Taro | 1 |
92231 | otiai10 | 2 |
100425 | Hanako | 7 |
125901 | Chika | 3 |
182234 | Mary | 5 |
199621 | Jack | 1 |
do
with client.connect() as conn:
report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
report.to_excel()
then you will get join_Sales_Reps.xlsx
file with
table | key | total | match | match_% | drop | drop_% |
---|---|---|---|---|---|---|
Sales | [rep_id] | 7 | 6 | 85.714 | 1 | 14.285 |
Reps | [id] | 8 | 6 | 75.00 | 2 | 25.00 |
Issues and TODOs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dataviper-0.1.7.tar.gz
.
File metadata
- Download URL: dataviper-0.1.7.tar.gz
- Upload date:
- Size: 14.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e124f9b64e754f43b5aef7b4c5c04633dd7fe0a8be52ff96889e4b25f87cec3a |
|
MD5 | 40afb0831bfecba043f5099001bb6c7d |
|
BLAKE2b-256 | 008d9adb64b7157d2e742ccdd125f7c00dec54b00881cb1a01669fba2fc41b25 |