Skip to main content

DataGlitch is a Python package designed to address common data challenges, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.

Project description

DataGlitch: A Python Toolkit for Messy Data

DataGlitch is a Python package designed to address common data challenges in a pandas DataFrame, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.

pip install DataGlitch

Usage

DataGlitch currently offers three functionalities:

  • dtype_detector: Find mixed data types in columns.
  • nonascii_handler: Detect and handle non-ASCII characters.
  • data_search: Search for the existence of specific columns or values.



dtype_detector

The dtype_detector uses regular expressions to detect different data types in a column through the find_numeric() function. This function takes a pandas Series with data type Object as an argument and returns three new variables: numeric, ambiguous and non_numeric. Each of these contain a subset of the original column.

numeric can contain any numeric value or any string value interpreted as numeric by the regular expression:

  • Integers: whole numbers, both positive and negative. Examples: -42, 123.
  • Floating-point numbers: decimal numbers, both positive and negative. Examples: -3.14, 0.5.
  • Numbers in scientific notation. Examples: 1.23e-4, -2E+10.
  • Fractions: fractions in the form of numerator/denominator. Examples: 1/2, -3/4.
  • Numbers with comma or dot as decimal separator. Examples: 1,000, 3.5.
  • Numbers with multiple decimal separators (comma or dot). Examples: 1.0.0, 2,000.5.6.

ambiguous can contain any value that is interpreted as numeric by the regular expression if it also has alphabetic or special characters at its front or end. Examples: 50cents, $10.

non_numeric can contain any value that is not classified as numeric or ambiguous. Examples: 2000-01-01, DataGlitch6000!.


# Import and apply
from DataGlitch.dtype_detector import find_numeric
numeric, ambiguous, non_numeric = find_numeric(df["col"])

Example operations:

# Drop non_numeric subset from dataframe
df = df[~df["col"].isin(non_numeric)]
# Replace values in subset with NAs
import numpy as np
df.loc[df["col"].isin(non_numeric), "col"] = np.nan

Other operations can occur directly on the dataframe. For instance, if wanting to correct comma-separated floats that have been identified in the column:

df["col"] = df["col"].replace(",", ".", regex=True)



nonascii_handler

The nonascii_handler uses the find_nonascii() function to locate rows/values with non-ascii characters in a DataFrame or Series with data type Object. For handling non-ascii, the user is offerred three options:

  1. Drop all rows that contain non-ascii values by setting the drop parameter to True.
  2. Replace values with non-ascii with np.nan to indicate missigness by setting the remove parameter to True.
  3. Translate values with non-ascii with the unidecode library by setting the translate parameter to True.

If none of the above options are selected, the data is returned as is (default).

from DataGlitch.nonascii_handler import find_nonascii
df_ascii = find_nonascii(df, drop=False, remove=False, translate=True)



data_search

data_search performs fuzzy string matching through the rapidfuzz library. It looks for the existance of columns in a dataset or particular values within a column.

from data_search import column_search, value_search

Columns are identified through the column_search() function which takes a pandas DataFrame, the name of the column as a string, and a cut-off score which defaults to 80. The output contains any matches and their similarity score. For a less strict search, the cut-off score can be lowered.

column_search(df, "column_name", score_cutoff=80)

The value_search() function looks for the existance of a value in a pandas Series. Even if the value under investigation is integer/float, it should still be passed as string to the function. The output includes any matches and their similarity score.

value_search(df["col"], "value", score_cutoff=80)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataGlitch-0.0.2.tar.gz (5.7 kB view details)

Uploaded Source

File details

Details for the file DataGlitch-0.0.2.tar.gz.

File metadata

  • Download URL: DataGlitch-0.0.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for DataGlitch-0.0.2.tar.gz
Algorithm Hash digest
SHA256 0a956fa1c32c32783ec5c986448ee6fc063fe9fe9c7303749a4d4c70949b0554
MD5 73edfec49f9164bbe1d03f68a6841b9c
BLAKE2b-256 da1ba8178486874cd4e7160706b8636cd4174b82f6849bcbf250439a51052087

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page