If your data is messy - Use Shmessy!

These details have not been verified by PyPI

Project links

Project description

Shmessy

Coverage report

If your data is messy - Use Shmessy!

Shmessy designed to deal with messy pandas dataframes. We all knows the frustrating times when we as analysts or data-engineers should handle messy dataframe and analyze them by ourselves.

The goal of this tiny tool is to identify the physical / logical data type for each Dataframe column. It based on fast validators that will validate the data (Based on a sample) against regex / pydantic types or any additional validation function that you want to implement.

As you understand, this tool was designed to deal with dirty data, ideally developed for Dataframes generated from CSV / Flat files or any source that doesn't contain strict schema.

Installation

pip install shmessy

Usage

You have two ways to use this tool

Identify the Dataframe schema

import pandas as pd
from shmessy import Shmessy

df = pd.read_csv('/tmp/file.csv')
inferred_schema = Shmessy().infer_schema(df)

Output (inferred_schema dump):

{
    "infer_duration_ms": 12,
    "columns": [
        {
            "field_name": "id",
            "source_type": "Integer",
            "inferred_type": "Integer"
        },
        {
            "field_name": "email_value",
            "source_type": "String",
            "inferred_type": "Email"
        },
        {
            "field_name": "date_value",
            "source_type": "String",
            "inferred_type": "Date",
            "inferred_pattern": "%d-%m-%Y"
        },
        {
            "field_name": "datetime_value",
            "source_type": "String",
            "inferred_type": "Datetime",
            "inferred_pattern": "%Y/%m/%d %H:%M:%S"
        },
        {
            "field_name": "yes_no_data",
            "source_type": "String",
            "inferred_type": "Boolean",
            "inferred_pattern": [
                "YES",
                "NO"
            ]
        },
        {
            "field_name": "unix_value",
            "source_type": "Integer",
            "inferred_type": "UnixTimestamp",
            "inferred_pattern": "ms"
        },
        {
            "field_name": "ip_value",
            "source_type": "String",
            "inferred_type": "IPv4"
        }
    ]
}

Identify and fix Pandas Dataframe

This piece of code will change the column types of the input Dataframe according to Messy infer.

import pandas as pd
from shmessy import Shmessy

df = pd.read_csv('/tmp/file.csv')
fixed_df = Shmessy().fix_schema(df)

Original Dataframe

Fixed Dataframe

After fix

Read Messy CSV file

from shmessy import Shmessy
df = Shmessy().read_csv('/tmp/file.csv')

Original file

Original Dataframe

Fixed Dataframe

After fix

API

Constructor

shmessy = Shmessy(
    sample_size: Optional[int] = 1000,
    reader_encoding: Optional[str] = "UTF-8",
    locale_formatter: Optional[str] = "en_US",
    use_random_sample: Optional[bool] = True,
    types_to_ignore: Optional[List[str]] = None,
    max_columns_num: Optional[int] = 500,
    fallback_to_string: Optional[bool] = False,  # Fallback to string in case of casting exception
    fallback_to_null: Optional[bool] = False,  # Fallback to null in case of casting exception
    use_csv_sniffer: Optional[bool] = True,  # Use python sniffer to identify the dialect (seperator / quote-char / etc...)
    fix_column_names: Optional[bool] = False,  # Replace non-alphabetic/numeric chars with underscore
    numeric_types_max_length: Optional[int] = 20,  # Fallback to string for numeric values with many digits
)

read_csv

shmessy.read_csv(filepath_or_buffer: Union[str, TextIO, BinaryIO]) -> DataFrame

infer_schema

shmessy.infer_schema(df: Dataframe) -> ShmessySchema

fix_schema

shmessy.fix_schema(df: Dataframe) -> DataFrame

get_inferred_schema

shmessy.get_inferred_schema() -> ShmessySchema

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.3

Aug 20, 2025

2.0.2

Jun 27, 2024

2.0.1

Jun 9, 2024

2.0.0

May 29, 2024

1.3.7

May 28, 2024

1.3.6

May 26, 2024

1.3.5

May 26, 2024

1.3.4

May 23, 2024

1.3.3

Apr 30, 2024

1.3.2

Apr 29, 2024

1.3.1

Apr 15, 2024

1.3.0

Apr 8, 2024

1.2.14

Mar 10, 2024

1.2.13

Feb 26, 2024

1.2.12

Feb 25, 2024

1.2.11

Feb 23, 2024

1.2.10

Feb 23, 2024

1.2.9

Feb 23, 2024

1.2.8

Feb 22, 2024

1.2.7

Feb 21, 2024

1.2.6

Feb 20, 2024

1.2.5

Feb 19, 2024

1.2.4

Feb 17, 2024

1.2.3

Feb 13, 2024

1.2.2

Feb 12, 2024

1.2.1

Feb 11, 2024

1.2.0

Feb 11, 2024

1.1.20

Feb 9, 2024

1.1.19

Feb 8, 2024

1.1.18

Feb 6, 2024

1.1.17

Feb 3, 2024

1.1.16

Jan 30, 2024

1.1.15

Jan 25, 2024

1.1.14

Jan 24, 2024

1.1.13

Jan 23, 2024

1.1.12

Jan 22, 2024

1.1.11

Jan 21, 2024

1.1.10

Jan 20, 2024

1.1.9

Jan 20, 2024

1.1.8

Jan 19, 2024

1.1.7

Jan 17, 2024

1.1.6

Jan 16, 2024

1.1.5

Jan 15, 2024

1.1.4

Jan 14, 2024

1.1.3

Jan 11, 2024

1.1.2

Jan 9, 2024

1.1.1

Jan 7, 2024

1.1.0

Jan 4, 2024

1.0.1

Jan 2, 2024

1.0.0

Jan 1, 2024

0.0.7

Dec 31, 2023

0.0.6

Dec 31, 2023

0.0.5

Dec 29, 2023

0.0.4

Dec 28, 2023

0.0.2

Dec 28, 2023

0.0.1

Dec 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shmessy-2.0.3.tar.gz (12.0 kB view details)

Uploaded Aug 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

shmessy-2.0.3-py3-none-any.whl (18.4 kB view details)

Uploaded Aug 20, 2025 Python 3

File details

Details for the file shmessy-2.0.3.tar.gz.

File metadata

Download URL: shmessy-2.0.3.tar.gz
Upload date: Aug 20, 2025
Size: 12.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.9.16 Linux/6.11.0-1018-azure

File hashes

Hashes for shmessy-2.0.3.tar.gz
Algorithm	Hash digest
SHA256	`c063f1c7c2bac5b9f18083ef09b87ce22fdc38427bd4ff7f89d88b5f44dd98b1`
MD5	`f8b17a6e4d3ddb69bdc6707eda656eb3`
BLAKE2b-256	`179f818bc1558839b0294843c508b7cb3936f3d4ecf31572aaa7e3bdc48f0923`

See more details on using hashes here.

File details

Details for the file shmessy-2.0.3-py3-none-any.whl.

File metadata

Download URL: shmessy-2.0.3-py3-none-any.whl
Upload date: Aug 20, 2025
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.4 CPython/3.9.16 Linux/6.11.0-1018-azure

File hashes

Hashes for shmessy-2.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86b781015486062c8f15f4b99abba40df80e58b9680e0d8c75962a9e61f93620`
MD5	`6641ebc5731b627c6bffa136684fcd8c`
BLAKE2b-256	`07963c28baa977c5a75a4d7cbd45793b3f41cd039809ea22f32b7f2cf922e081`

See more details on using hashes here.

shmessy 2.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Shmessy

If your data is messy - Use Shmessy!

Installation

Usage

Identify the Dataframe schema

Identify and fix Pandas Dataframe

Original Dataframe

Fixed Dataframe

Read Messy CSV file

Original file

Fixed Dataframe

API

Constructor

read_csv

infer_schema

fix_schema

get_inferred_schema

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes