DataGlitch is a Python package designed to address common data challenges, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.
Project description
DataGlitch: A Python Toolkit for Messy Data
DataGlitch is a Python package designed to address common data challenges in a pandas DataFrame, including handling mixed data types, non-ASCII values, and facilitating dataset exploration.
pip install DataGlitch
Usage
DataGlitch currently offers three functionalities:
dtype_detector
: Find mixed data types in columns.nonascii_handler
: Detect and handle non-ASCII characters.data_search
: Search for the existence of specific columns or values.
dtype_detector
The dtype_detector
uses regular expressions to detect different data types in a column through the find_numeric()
function. This function takes a pandas Series with data type Object
as an argument and returns three new variables: numeric
, ambiguous
and non_numeric
. Each of these contain a subset of the original column.
numeric
can contain any numeric value or any string value interpreted as numeric by the regular expression:
- Integers: whole numbers, both positive and negative. Examples:
-42
,123
. - Floating-point numbers: decimal numbers, both positive and negative. Examples:
-3.14
,0.5
. - Numbers in scientific notation. Examples:
1.23e-4
,-2E+10
. - Fractions: fractions in the form of numerator/denominator. Examples:
1/2
,-3/4
. - Numbers with comma or dot as decimal separator. Examples:
1,000
,3.5
. - Numbers with multiple decimal separators (comma or dot). Examples:
1.0.0
,2,000.5.6
.
ambiguous
can contain any value that is interpreted as numeric by the regular expression if it also has alphabetic or special characters at its front or end. Examples: 50cents
, $10
.
non_numeric
can contain any value that is not classified as numeric
or ambiguous
. Examples: 2000-01-01
, DataGlitch6000!
.
# Import and apply
from DataGlitch.dtype_detector import find_numeric
numeric, ambiguous, non_numeric = find_numeric(df["col"])
Example operations:
# Drop non_numeric subset from dataframe
df = df[~df["col"].isin(non_numeric)]
# Replace values in subset with NAs
import numpy as np
df.loc[df["col"].isin(non_numeric), "col"] = np.nan
Other operations can occur directly on the dataframe. For instance, if wanting to correct comma-separated floats that have been identified in the column:
df["col"] = df["col"].replace(",", ".", regex=True)
nonascii_handler
The nonascii_handler
uses the find_nonascii()
function to locate rows/values with non-ascii characters in a DataFrame or Series with data type Object
. For handling non-ascii, the user is offerred three options:
- Drop all rows that contain non-ascii values by setting the
drop
parameter toTrue
. - Replace values with non-ascii with
np.nan
to indicate missigness by setting theremove
parameter toTrue
. - Translate values with non-ascii with the
unidecode
library by setting thetranslate
parameter toTrue
.
If none of the above options are selected, the data is returned as is (default).
from DataGlitch.nonascii_handler import find_nonascii
df_ascii = find_nonascii(df, drop=False, remove=False, translate=True)
data_search
data_search
performs fuzzy string matching through the rapidfuzz
library. It looks for the existance of columns in a dataset or particular values within a column.
from data_search import column_search, value_search
Columns are identified through the column_search()
function which takes a pandas DataFrame, the name of the column as a string, and a cut-off score which defaults to 80. The output contains any matches and their similarity score. For a less strict search, the cut-off score can be lowered.
column_search(df, "column_name", score_cutoff=80)
The value_search()
function looks for the existance of a value in a pandas Series. Even if the value under investigation is integer/float, it should still be passed as string to the function. The output includes any matches and their similarity score.
value_search(df["col"], "value", score_cutoff=80)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.