Skip to main content

A simple and easy to use Data Quality (DQ) tool built with Python.

Project description

Tiny Timmy

A dead simple and easy to use Data Quality (DQ) tool built for Dataframes and Files with Python.

Tiny Timmy uses the Python bindings for Polars a Rust based DataFrame tool.

Support includes ...

  • polars
  • pandas
  • pyspark
  • csv files
  • parquet files

Both dataframe and file support. Simply "point and shoot."

Installation

Install Tiny Timmy with pip

pip install tinytimmy

Usage

Create an instance of Tiny Timmy.

  • specify source_type
    • polars
    • pandas
    • pyspark
    • csv
    • parquet
  • specify either file_path or dataframe
from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")

Then call either the default checks or a custom check.

results = tm.default_checks()
results = tm.run_custom_check(["{SQL filter}", "{SQL filter}"])

You can pass Tiny Timmy a dataframe while specifying what type it is (pandas, polars, pyspark) and ask for default_checks, also you can simply pass a file uri to a csv or parquet file.

You can also pass custom DQ checks as a list of SQL statements in a normal WHERE clause.

Tiny Timmy returns check results as a Polars dataframe by default, you can request the results as a pandas or pyspark dataframe.

results = tm.default_checks(return_as='pandas')

For example.

┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ …                                 ┆ …           │
│ started_at_whitespace_count       ┆ 1000        │
│ ended_at_whitespace_count         ┆ 1000        │
│ start_station_name_whitespace_co… ┆ 22          │
│ end_station_name_whitespace_coun… ┆ 22          │

Current functionality ...

  • default_checks()
    • check all columns for null values
    • check if dataset is distinct or contains duplicates
    • check if columns have whitespace
    • check for leading or trailing whitespace
  • run_custom_check(["{some SQL WHERE clause}"])

Example Usage

CSV support.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has 45 duplicates

Pandas support.

from tinytimmy.tinytim import TinyTim
df = pd.read_csv("202306-divvy-tripdata.csv")
tm = TinyTim(source_type="pandas", dataframe=df)
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has no duplicates

Custom Data Quality checks are supported as a list of SQL based formats. They are given as they would appear in a WHERE clause. You can pass one or more checks in the list.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
tm.default_checks()
results = tm.run_custom_check(["start_station_name IS NULL", "end_station_name IS NULL"])
Column start_station_name has 978 null values
Column start_station_id has 978 null values
Column end_station_name has 978 null values
Column end_station_id has 978 null values
Your dataset has no duplicates
Column started_at has 1000 whitespace values
Column ended_at has 1000 whitespace values
Column start_station_name has 22 whitespace values
Column end_station_name has 22 whitespace values
No leading or trailing whitespace values found
shape: (10, 2)
┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ null_check_end_station_name       ┆ 978         │
│ null_check_end_station_id         ┆ 978         │
│ …                                 ┆ …           │
└───────────────────────────────────┴─────────────┘
Your custom check start_station_name IS NULL found 978 records that match your filter statement
Your custom check end_station_name IS NULL found 978 records that match your filter statement

Tests / Local Setup / Contributions.

To develop and work on TinyTimmy locally, a Docker image and docker-compose is provided.

First, build the image docker build --tag=tinytimmy .

To run the local unit tests run ... docker-compose up test

To simply work inside the Docker container run ... docker run -it tinytimmy /bin/bash

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinytimmy-0.1.4.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinytimmy-0.1.4-py3-none-any.whl (38.4 kB view details)

Uploaded Python 3

File details

Details for the file tinytimmy-0.1.4.tar.gz.

File metadata

  • Download URL: tinytimmy-0.1.4.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Darwin/22.6.0

File hashes

Hashes for tinytimmy-0.1.4.tar.gz
Algorithm Hash digest
SHA256 40db8138095d26a461118c9839590a539615b8a0c3377b2902b30622536d45b9
MD5 2f5c019c21d9c4ef8ced18b7f5291cd5
BLAKE2b-256 add85b1730b4dc177ba644e81dce9914ac1b7bc3ac13f34add4db6c0bf33beac

See more details on using hashes here.

File details

Details for the file tinytimmy-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: tinytimmy-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 38.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.9.16 Darwin/22.6.0

File hashes

Hashes for tinytimmy-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 476efe64c4f0e83d23c4c520cea8088282c4698356d76bb7418b7b509ce222f9
MD5 e35ebbeaaf7036efda7634886ff7bb41
BLAKE2b-256 14a97f346a4d85cdc62becaed051b519ec5bc0926563ef674dfc324893e5c24a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page