Skip to main content

If your data is messy - Use Shmessy!

Project description

Shmessy

PyPI version PyPI - Downloads Coverage report CI License PyPI - Python Version OS OS OS Code style: black

If your data is messy - Use Shmessy!

Shmessy designed to deal with messy pandas dataframes. We all knows the frustrating times when we as analysts or data-engineers should handle messy dataframe and analyze them by ourselves.

The goal of this tiny tool is to identify the physical / logical data type for each Dataframe column. It based on fast validators that will validate the data (Based on a sample) against regex / pydantic types or any additional validation function that you want to implement.

As you understand, this tool was designed to deal with dirty data, ideally developed for Dataframes generated from CSV / Flat files or any source that doesn't contain strict schema.

Installation

pip install shmessy

Usage

You have two ways to use this tool

Identify the Dataframe schema

import pandas as pd
from shmessy import Shmessy

df = pd.read_csv('/tmp/file.csv')
inferred_schema = Shmessy().infer_schema(df)

Output (inferred_schema dump):

{
    "infer_duration_ms": 12,
    "columns": [
        {
            "field_name": "id",
            "source_type": "Integer",
            "inferred_type": "Integer"
        },
        {
            "field_name": "email_value",
            "source_type": "String",
            "inferred_type": "Email"
        },
        {
            "field_name": "date_value",
            "source_type": "String",
            "inferred_type": "Date",
            "inferred_pattern": "%d-%m-%Y"
        },
        {
            "field_name": "datetime_value",
            "source_type": "String",
            "inferred_type": "Datetime",
            "inferred_pattern": "%Y/%m/%d %H:%M:%S"
        },
        {
            "field_name": "yes_no_data",
            "source_type": "String",
            "inferred_type": "Boolean",
            "inferred_pattern": [
                "YES",
                "NO"
            ]
        },
        {
            "field_name": "unix_value",
            "source_type": "Integer",
            "inferred_type": "UnixTimestamp",
            "inferred_pattern": "ms"
        },
        {
            "field_name": "ip_value",
            "source_type": "String",
            "inferred_type": "IPv4"
        }
    ]
}

Identify and fix Pandas Dataframe

This piece of code will change the column types of the input Dataframe according to Messy infer.

import pandas as pd
from shmessy import Shmessy

df = pd.read_csv('/tmp/file.csv')
fixed_df = Shmessy().fix_schema(df)

Original Dataframe

Original Dataframe

Fixed Dataframe

After fix

Read Messy CSV file

from shmessy import Shmessy
df = Shmessy().read_csv('/tmp/file.csv')

Original file

Original Dataframe

Fixed Dataframe

After fix

API

Constructor

shmessy = Shmessy(
    sample_size: Optional[int] = 1000,
    reader_encoding: Optional[str] = "UTF-8",
    locale_formatter: Optional[str] = "en_US",
    use_random_sample: Optional[bool] = True,
    types_to_ignore: Optional[List[str]] = None,
    max_columns_num: Optional[int] = 500,
    fallback_to_string: Optional[bool] = False,  # Fallback to string in case of casting exception
    fallback_to_null: Optional[bool] = False,  # Fallback to null in case of casting exception
    use_csv_sniffer: Optional[bool] = True,  # Use python sniffer to identify the dialect (seperator / quote-char / etc...)
    fix_column_names: Optional[bool] = False,  # Replace non-alphabetic/numeric chars with underscore
    numeric_types_max_length: Optional[int] = 20,  # Fallback to string for numeric values with many digits
)

read_csv

shmessy.read_csv(filepath_or_buffer: Union[str, TextIO, BinaryIO]) -> DataFrame

infer_schema

shmessy.infer_schema(df: Dataframe) -> ShmessySchema

fix_schema

shmessy.fix_schema(df: Dataframe) -> DataFrame

get_inferred_schema

shmessy.get_inferred_schema() -> ShmessySchema

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shmessy-2.0.3.tar.gz (12.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shmessy-2.0.3-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file shmessy-2.0.3.tar.gz.

File metadata

  • Download URL: shmessy-2.0.3.tar.gz
  • Upload date:
  • Size: 12.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.9.16 Linux/6.11.0-1018-azure

File hashes

Hashes for shmessy-2.0.3.tar.gz
Algorithm Hash digest
SHA256 c063f1c7c2bac5b9f18083ef09b87ce22fdc38427bd4ff7f89d88b5f44dd98b1
MD5 f8b17a6e4d3ddb69bdc6707eda656eb3
BLAKE2b-256 179f818bc1558839b0294843c508b7cb3936f3d4ecf31572aaa7e3bdc48f0923

See more details on using hashes here.

File details

Details for the file shmessy-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: shmessy-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.9.16 Linux/6.11.0-1018-azure

File hashes

Hashes for shmessy-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 86b781015486062c8f15f4b99abba40df80e58b9680e0d8c75962a9e61f93620
MD5 6641ebc5731b627c6bffa136684fcd8c
BLAKE2b-256 07963c28baa977c5a75a4d7cbd45793b3f41cd039809ea22f32b7f2cf922e081

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page