Skip to main content

A simple Data Quality Assessment Tool based on SQL

Project description


dataviper

PyPI version GitHub Action codecov Python 3.6

dataviper is a SQL-based tool to get the basic data preparation done in easy way, with doing

  • Create "Data Profile" report of a table
  • One-hot encode for "Categorical Columns" and create a "one-hot" table
  • Check "Joinability" between 2 tables
  • // TODO: and more

Example

pip install dataviper

Your main.py will look like this

from dataviper import Client
from dataviper.source import MySQL

client = Client(source=MySQL({
    'host': 'localhost',
    'user': 'root',
    'password': 'xxxxxx',
    'database': 'your_database'
}))

with client.connect():
    profile = client.profile('Your_Table')
    profile.to_excel()
python3 main.py
# Then you will get 'profile_Your_Table.xlsx' 🤗

image

Why?

It's known that "Data Profiling" needs to be done with scanning all the rows in a table. If you try to do this naively by pandas or any libraries which internally use pandas, it's not avoidable to use bunch of memory of your local machine and freeze your work.

dataviper is a SQL-based Data Profiling tool, which simply and dynamically generates SQLs and lets the database machine do the annoying calculation.

With dataviper, you don't have to have massive local computer. All you need are a stable network and reachable SQL db.

Data Sources

You can choose your data source from

  • SQL Server
    • profile
    • pivot
    • joinability
    • histogram
  • MySQL
    • profile
    • pivot
    • joinability
    • histogram
  • PostgreSQL
  • CSV
    • profile
    • pivot
    • joinability
    • histogram
  • Excel

APIs

profile

Create "Data Profile" excel file of a specified table.

When you have Sales table like this

id region sales_type price rep_id
1 jp phone 240 115723
2 us web 90 125901
3 jp web 560 8003
4 us shop 920 182234
5 jp NULL 90 92231
6 us shop 180 100425
7 us shop 70 52934

do

with client.connect() as conn:
    table_name = 'Sales'
    profile = client.profile(table_name, example_count=3)
    profile.to_excel()

then you will get profile_Sales.xlsx file with

column_name data_type null_count null_% unique_count unique_% min max avg std example_top_3 example_last_3
id bigint 0 0 7 100.00 1 7 4.0 2.0 [1,2,3] [5,6,7]
region varchar 0 0 2 28.57 [jp,us,jp] [jp,us,us]
sales_type varchar 1 14.28 3 42.85 [phone,web,web] [None,shop,shop]
price int 0 0 6 85.71 70 920 307.1428 295.379 [240,90,560] [90,180,70]
rep_id int 0 0 7 100.00 8003 182234 96778.7142 51195.79065 [115723,125901,8003] [92231,100425,52934]

pivot

Spread categorical columns to N binary columns.

When you have Sales table like above, do

with client.connect() as conn:
    table_name = 'Sales'
    key = 'id'
    categorical_columns = ['region', 'sales_type']
    profile = client.get_schema(table_name)
    client.pivot(profile, key, categorical_columns)

then you will get Sales_PIVOT_YYYYmmddHHMM table with

id region_jp region_us sales_type_phone sales_type_web sales_type_shop
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 0 1 0
4 0 1 0 0 1
5 1 0 0 0 0
6 0 1 0 0 1
7 0 1 0 0 1

joinability

Count how much 2 tables can be joined.

When you have Sales table like above, and Reps table like this

id name tenure
8003 Hiromu 9
8972 Ochiai 6
52934 Taro 1
92231 otiai10 2
100425 Hanako 7
125901 Chika 3
182234 Mary 5
199621 Jack 1

do

with client.connect() as conn:
    report = client.joinability(on={'Sales': 'rep_id', 'Reps': 'id'})
    report.to_excel()

then you will get join_Sales_Reps.xlsx file with

table key total match match_% drop drop_%
Sales [rep_id] 7 6 85.714 1 14.285
Reps [id] 8 6 75.00 2 25.00

Issues and TODOs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataviper-0.1.7.tar.gz (14.9 kB view details)

Uploaded Source

File details

Details for the file dataviper-0.1.7.tar.gz.

File metadata

  • Download URL: dataviper-0.1.7.tar.gz
  • Upload date:
  • Size: 14.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0

File hashes

Hashes for dataviper-0.1.7.tar.gz
Algorithm Hash digest
SHA256 e124f9b64e754f43b5aef7b4c5c04633dd7fe0a8be52ff96889e4b25f87cec3a
MD5 40afb0831bfecba043f5099001bb6c7d
BLAKE2b-256 008d9adb64b7157d2e742ccdd125f7c00dec54b00881cb1a01669fba2fc41b25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page