DataGlitch is a Python package designed to address common data challenges, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.

These details have not been verified by PyPI

Project links

Project description

DataGlitch: A Python Toolkit for Messy Data

DataGlitch is a Python package designed to address common data challenges in a pandas DataFrame, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.

pip install DataGlitch

Usage

DataGlitch currently offers three functionalities:

dtype_detector: Find mixed data types in columns.
nonascii_handler: Detect and handle non-ASCII characters.
data_search: Search for the existence of specific columns or values.

dtype_detector

The dtype_detector uses regular expressions to detect different data types in a column through the find_numeric() function. This function takes a pandas Series with data type Object as an argument and returns three new variables: numeric, ambiguous and non_numeric. Each of these contain a subset of the original column.

numeric can contain any numeric value or any string value interpreted as numeric by the regular expression:

Integers: whole numbers, both positive and negative. Examples: -42, 123.
Floating-point numbers: decimal numbers, both positive and negative. Examples: -3.14, 0.5.
Numbers in scientific notation. Examples: 1.23e-4, -2E+10.
Fractions: fractions in the form of numerator/denominator. Examples: 1/2, -3/4.
Numbers with comma or dot as decimal separator. Examples: 1,000, 3.5.
Numbers with multiple decimal separators (comma or dot). Examples: 1.0.0, 2,000.5.6.

ambiguous can contain any value that is interpreted as numeric by the regular expression if it also has alphabetic or special characters at its front or end. Examples: 50cents, $10.

non_numeric can contain any value that is not classified as numeric or ambiguous. Examples: 2000-01-01, DataGlitch6000!.

# Import and apply
from DataGlitch.dtype_detector import find_numeric
numeric, ambiguous, non_numeric = find_numeric(df["col"])

Example operations:

# Drop non_numeric subset from dataframe
df = df[~df["col"].isin(non_numeric)]

# Replace values in subset with NAs
import numpy as np
df.loc[df["col"].isin(non_numeric), "col"] = np.nan

Other operations can occur directly on the dataframe. For instance, if wanting to correct comma-separated floats that have been identified in the column:

df["col"] = df["col"].replace(",", ".", regex=True)

nonascii_handler

The nonascii_handler uses the find_nonascii() function to locate rows/values with non-ascii characters in a DataFrame or Series with data type Object. For handling non-ascii, the user is offerred three options:

Drop all rows that contain non-ascii values by setting the drop parameter to True.
Replace values with non-ascii with np.nan to indicate missigness by setting the remove parameter to True.
Translate values with non-ascii with the unidecode library by setting the translate parameter to True.

If none of the above options are selected, the data is returned as is (default).

from DataGlitch.nonascii_handler import find_nonascii
df_ascii = find_nonascii(df, drop=False, remove=False, translate=True)

data_search

data_search performs fuzzy string matching through the rapidfuzz library. It looks for the existance of columns in a dataset or particular values within a column.

from data_search import column_search, value_search

Columns are identified through the column_search() function which takes a pandas DataFrame, the name of the column as a string, and a cut-off score which defaults to 80. The output contains any matches and their similarity score. For a less strict search, the cut-off score can be lowered.

column_search(df, "column_name", score_cutoff=80)

The value_search() function looks for the existance of a value in a pandas Series. Even if the value under investigation is integer/float, it should still be passed as string to the function. The output includes any matches and their similarity score.

value_search(df["col"], "value", score_cutoff=80)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.2

May 22, 2023

0.0.1

May 22, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataGlitch-0.0.2.tar.gz (5.7 kB view details)

Uploaded May 22, 2023 Source

File details

Details for the file DataGlitch-0.0.2.tar.gz.

File metadata

Download URL: DataGlitch-0.0.2.tar.gz
Upload date: May 22, 2023
Size: 5.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for DataGlitch-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`0a956fa1c32c32783ec5c986448ee6fc063fe9fe9c7303749a4d4c70949b0554`
MD5	`73edfec49f9164bbe1d03f68a6841b9c`
BLAKE2b-256	`da1ba8178486874cd4e7160706b8636cd4174b82f6849bcbf250439a51052087`

See more details on using hashes here.

DataGlitch 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataGlitch: A Python Toolkit for Messy Data

Usage

dtype_detector

nonascii_handler

data_search

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes