tinytimmy

A simple and easy to use Data Quality (DQ) tool built with Python.

Project description

Tiny Timmy

A dead simple and easy to use Data Quality (DQ) tool built for Dataframes and Files with Python.

Tiny Timmy uses the Python bindings for Polars a Rust based DataFrame tool.

Support includes ...

polars
pandas
pyspark
csv files
parquet files

Both dataframe and file support. Simply "point and shoot."

Installation

Install Tiny Timmy with pip

pip install tinytimmy

Usage

Create an instance of Tiny Timmy.

specify source_type
- polars
- pandas
- pyspark
- csv
- parquet
specify either file_path or dataframe

tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")

Then call either the default checks or a custom check.

results = tm.default_checks()
results = tm.run_custom_check(["{SQL filter}"])

You can pass Tiny Timmy a dataframe while specifying what type it is (pandas, polars, pyspark) and ask for default_checks, also you can simply pass a file uri to a csv or parquet file.

You can also pass custom DQ checks in the form of a list of SQL statements that would be found in a nomral WHERE clause. Results of your checks are returned as a Polars dataframe.

The results of all Tiny Timmy checks are return as a Polars dataframe.

Current functionality ...

default_checks()
- check all columns for null values
- check if dataset is distinct or contains duplicates
run_custom_check(["{some SQL WHERE clause}"])

Example Usage

CSV support.

tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has 45 duplicates

Pandas support.

df = pd.read_csv("202306-divvy-tripdata.csv")
tm = TinyTim(source_type="pandas", dataframe=df)
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has no duplicates

Custom Data Quality checks are supported as a list of SQL based formats. They are given as they would appear in a WHERE clause. You can pass one or more checks in the list.

tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
tm.default_checks()
results = tm.run_custom_check(["start_station_name IS NULL", "end_station_name IS NULL"])
>> Your custom check found 978 records that match your filter statement
┌───────────────────────────────────┬───────────────────────────────────┐
│ start_station_name IS NULL custo… ┆ end_station_name IS NULL custom_… │
│ ---                               ┆ ---                               │
│ i64                               ┆ i64                               │
╞═══════════════════════════════════╪═══════════════════════════════════╡
│ 978                               ┆ 978                               │
└───────────────────────────────────┴───────────────────────────────────┘

Project details

Release history Release notifications | RSS feed

0.1.4

Sep 9, 2023

0.1.3

Aug 29, 2023

0.1.2

Aug 12, 2023

This version

0.1.1

Aug 11, 2023

0.1.0

Aug 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinytimmy-0.1.1.tar.gz (37.0 kB view hashes)

Uploaded Aug 11, 2023 Source

Built Distribution

tinytimmy-0.1.1-py3-none-any.whl (36.5 kB view hashes)

Uploaded Aug 11, 2023 Python 3

Hashes for tinytimmy-0.1.1.tar.gz

Hashes for tinytimmy-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`f7fc81c60a94747b4438399fcd75d6464db3ccd6ab0812806b364562ee6d5723`
MD5	`72125340759d9a33fbca77616ec36a27`
BLAKE2b-256	`04bb36bd5764bc7a955046f06d3232478f724085364e5f999ff66a8bbf288538`

Hashes for tinytimmy-0.1.1-py3-none-any.whl

Hashes for tinytimmy-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`83bbad0223f32b63f135bb1d761165530ef001e3a6a7ba45a9bd7d2ca9fca394`
MD5	`5078771a2a86bb3c1ca7fe05da90172a`
BLAKE2b-256	`b3744b215751b8c0e56bf0e7ecd5455133f31cd1982f63470314c329483164bd`