Cleans data, best to be used as a part of initial preprocessor
Project description
refineryframe
Under construction! Not ready for use yet! Currently experimenting and planning!
Initial plans
Goal of the package is to simplify life for data scientist that have to deal with imperfect raw data. The package suppose to detect and clean unexpected values, while doubling as safeguard in production code based on predifined conditions that arise from business assumptions or any other source. The package is well suited to be an initial preprocessing step in ml pipelines situated between data gathering and training/scoring steps.
Developed by Kyrylo Mordan (c) 2023
Installation
Install refineryframe
via pip with
pip install refineryframe
Package usage example
import os
import sys
sys.path.append(os.path.dirname(sys.path[0]))
from refineryframe.tns import *
Creating example data (exceptionally messy dataframe)
df = pd.DataFrame({
'num_id' : [1, 2, 3, 4, 5],
'NumericColumn': [1, -np.inf, np.inf,np.nan, None],
'NumericColumn_exepted': [1, -996, np.inf,np.nan, None],
'NumericColumn2': [None, None, 1,None, None],
'NumericColumn3': [1, 2, 3, 4, 5],
'DateColumn': pd.date_range(start='2022-01-01', periods=5),
'DateColumn2': [pd.NaT,pd.to_datetime('2022-01-01'),pd.NaT,pd.NaT,pd.NaT],
'DateColumn3': ['2122-05-01',
'2022-01-01',
'2021-01-01',
'1000-01-09',
'1850-01-09'],
'CharColumn': ['Fół', None, np.nan, 'nót eXpęćTęd', '']
})
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
num_id | NumericColumn | NumericColumn_exepted | NumericColumn2 | NumericColumn3 | DateColumn | DateColumn2 | DateColumn3 | CharColumn | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.0 | 1.0 | NaN | 1 | 2022-01-01 | NaT | 2122-05-01 | Fół |
1 | 2 | -inf | -996.0 | NaN | 2 | 2022-01-02 | 2022-01-01 | 2022-01-01 | None |
2 | 3 | inf | inf | 1.0 | 3 | 2022-01-03 | NaT | 2021-01-01 | NaN |
3 | 4 | NaN | NaN | NaN | 4 | 2022-01-04 | NaT | 1000-01-09 | nót eXpęćTęd |
4 | 5 | NaN | NaN | NaN | 5 | 2022-01-05 | NaT | 1850-01-09 |
Defining specification for the dataframe
MISSING_TYPES = {'date_not_delivered': '1850-01-09',
'date_other_missing_type': '1850-01-08',
'numeric_not_delivered': -999,
'character_not_delivered': 'missing'}
unexpected_exceptions = {
"col_names_types": "NONE",
"missing_values": ["NumericColumn_exepted"],
"missing_types": "NONE",
"inf_values": "NONE",
"date_format": "NONE",
"duplicates": "ALL",
"date_range": "NONE",
"numeric_range": "NONE"
}
replace_dict = {-996 : -999,
"1000-01-09": "1850-01-09"}
Initializing TnS class
tns = Refiner(dataframe = df,
replace_dict = replace_dict,
unexpected_exceptions_duv = unexpected_exceptions)
function for detecting column types
tns.get_type_dict_from_dataframe()
{'num_id': 'int64',
'NumericColumn': 'float64',
'NumericColumn_exepted': 'float64',
'NumericColumn2': 'float64',
'NumericColumn3': 'int64',
'DateColumn': 'datetime64[ns]',
'DateColumn2': 'datetime64[ns]',
'DateColumn3': 'object',
'CharColumn': 'object'}
adding expected types
types_dict_str = {'num_id' : 'int64',
'NumericColumn' : 'float64',
'NumericColumn_exepted' : 'float64',
'NumericColumn2' : 'float64',
'NumericColumn3' : 'int64',
'DateColumn' : 'datetime64[ns]',
'DateColumn2' : 'datetime64[ns]',
'DateColumn3' : 'datetime64[ns]',
'CharColumn' : 'object'}
moulding types
tns.set_types(type_dict = types_dict_str)
tns.get_type_dict_from_dataframe()
{'num_id': 'int64',
'NumericColumn': 'float64',
'NumericColumn_exepted': 'float64',
'NumericColumn2': 'float64',
'NumericColumn3': 'int64',
'DateColumn': 'datetime64[ns]',
'DateColumn2': 'datetime64[ns]',
'DateColumn3': 'datetime64[ns]',
'CharColumn': 'object'}
Check independent conditions
tns.check_missing_types()
tns.check_missing_values()
tns.check_inf_values()
tns.check_col_names_types()
tns.check_date_format()
tns.check_duplicates()
tns.check_numeric_range()
WARNING:Refiner:Column NumericColumn_exepted: (-999) : 1 : 20.00%
WARNING:Refiner:Column DateColumn3: (1850-01-09) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column CharColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
Using the main function to detect unexpected values
tns.detect_unexpected_values(earliest_date = "1920-01-01",
latest_date = "DateColumn3")
DEBUG:Refiner:=== checking column names and types
DEBUG:Refiner:=== checking for presence of missing values
WARNING:root:Column CharColumn: (NA) : 2 : 40.00%
WARNING:root:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:root:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:root:Column NumericColumn2: (NA) : 4 : 80.00%
DEBUG:Refiner:=== checking for presence of missing types
WARNING:root:Column DateColumn3: (1850-01-09) : 2 : 40.00%
WARNING:root:Column NumericColumn_exepted: (-999) : 1 : 20.00%
DEBUG:Refiner:=== checking propper date format
WARNING:root:Column DateColumn2 has non-date values or unexpected format.
DEBUG:Refiner:=== checking expected date range
WARNING:root:** Not all dates in DateColumn are later than DateColumn3
WARNING:root:Column DateColumn : future date : 4 : 80.00%
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
WARNING:root:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:root:Column NumericColumn_exepted: (INF) : 1 : 20.00%
DEBUG:Refiner:=== checking expected numeric range
WARNING:Refiner:Percentage of passed tests: 50.00%
tns.duv_score
0.5
Using function to replace unexpected values with missing types
tns.replace_unexpected_values(numeric_lower_bound = "NumericColumn3",
numeric_upper_bound = 4,
earliest_date = "1920-01-02",
latest_date = "DateColumn2",
unexpected_exceptions = {"irregular_values": "NONE",
"date_range": "DateColumn",
"numeric_range": "NONE",
"capitalization": "NONE",
"unicode_character": "NONE"})
DEBUG:Refiner:=== replacing missing values in category cols with missing types
DEBUG:Refiner:=== replacing all upper case characters with lower case
DEBUG:Refiner:=== replacing character unicode to latin
DEBUG:Refiner:=== replacing missing values in date cols with missing types
DEBUG:Refiner:=== replacing missing values in numeric cols with missing types
DEBUG:Refiner:=== replacing values outside of expected date range
DEBUG:Refiner:=== replacing values outside of expected numeric range
DEBUG:Refiner:** Usable values in the dataframe: 44.44%
DEBUG:Refiner:** Uncorrected data quality score: 32.22%
DEBUG:Refiner:** Corrected data quality score: 52.57%
tns.dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
num_id | NumericColumn | NumericColumn_exepted | NumericColumn2 | NumericColumn3 | DateColumn | DateColumn2 | DateColumn3 | CharColumn | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.0 | 1.0 | -999.0 | 1 | 2022-01-01 | 1850-01-09 | 1850-01-09 | fol |
1 | 2 | -999.0 | -999.0 | -999.0 | 2 | 2022-01-02 | 2022-01-01 | 2022-01-01 | missing |
2 | 3 | -999.0 | -999.0 | -999.0 | 3 | 2022-01-03 | 1850-01-09 | 1850-01-09 | missing |
3 | 4 | -999.0 | -999.0 | -999.0 | 4 | 2022-01-04 | 1850-01-09 | 1850-01-09 | not expected |
4 | 5 | -999.0 | -999.0 | -999.0 | -999 | 2022-01-05 | 1850-01-09 | 1850-01-09 | missing |
Use complex targeted conditions
unexpected_conditions = {
'1': {
'description': 'Replace numeric missing with with zero',
'group': 'regex_columns',
'features': r'^Numeric',
'query': "{col} < 0",
'warning': False,
'set': 0
},
'2': {
'description': "Clean text column from '-ing' endings and 'not ' beginings",
'group': 'regex clean',
'features': ['CharColumn'],
'query': [r'ing', r'^not.'],
'warning': False,
'set': ''
},
'3': {
'description': "Replace numeric values in certain column with zeros if > 2",
'group': 'multicol mapping',
'features': ['NumericColumn3'],
'query': '{col} > 2',
'warning': False,
'set': 0
},
'4': {
'description': "Replace strings with values if some part of the string is detected",
'group': 'string check',
'features': ['CharColumn'],
'query': f"CharColumn.str.contains('cted', regex = True)",
'warning': False,
'set': 'miss'
}
}
tns.replace_unexpected_values(unexpected_conditions = unexpected_conditions)
DEBUG:Refiner:=== replacing missing values in category cols with missing types
DEBUG:Refiner:=== replacing all upper case characters with lower case
DEBUG:Refiner:=== replacing character unicode to latin
DEBUG:Refiner:=== replacing with additional cons
DEBUG:root:Replace numeric missing with with zero
DEBUG:root:Clean text column from '-ing' endings and 'not ' beginings
DEBUG:root:Replace numeric values in certain column with zeros if > 2
DEBUG:root:Replace strings with values if some part of the string is detected
DEBUG:Refiner:=== replacing missing values in date cols with missing types
DEBUG:Refiner:=== replacing missing values in numeric cols with missing types
DEBUG:Refiner:=== replacing values outside of expected date range
DEBUG:Refiner:=== replacing values outside of expected numeric range
DEBUG:Refiner:** Usable values in the dataframe: 82.22%
DEBUG:Refiner:** Uncorrected data quality score: 88.89%
DEBUG:Refiner:** Corrected data quality score: 97.53%
tns.dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
num_id | NumericColumn | NumericColumn_exepted | NumericColumn2 | NumericColumn3 | DateColumn | DateColumn2 | DateColumn3 | CharColumn | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1.0 | 1.0 | 0.0 | 1 | 2022-01-01 | 1850-01-09 | 1850-01-09 | fol |
1 | 2 | 0.0 | 0.0 | 0.0 | 2 | 2022-01-02 | 2022-01-01 | 2022-01-01 | miss |
2 | 3 | 0.0 | 0.0 | 0.0 | 0 | 2022-01-03 | 1850-01-09 | 1850-01-09 | miss |
3 | 4 | 0.0 | 0.0 | 0.0 | 0 | 2022-01-04 | 1850-01-09 | 1850-01-09 | miss |
4 | 5 | 0.0 | 0.0 | 0.0 | 0 | 2022-01-05 | 1850-01-09 | 1850-01-09 | miss |
tns.detect_unexpected_values(unexpected_exceptions = {
"col_names_types": "NONE",
"missing_values": "NONE",
"missing_types": "ALL",
"inf_values": "NONE",
"date_format": "NONE",
"duplicates": "ALL",
"date_range": "NONE",
"numeric_range": "NONE"
})
DEBUG:Refiner:=== checking column names and types
DEBUG:Refiner:=== checking for presence of missing values
DEBUG:Refiner:=== checking propper date format
DEBUG:Refiner:=== checking expected date range
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
DEBUG:Refiner:=== checking expected numeric range
Scores
print(f'duv_score: {tns.duv_score :.4}')
print(f'ruv_score0: {tns.ruv_score0 :.4}')
print(f'ruv_score1: {tns.ruv_score1 :.4}')
print(f'ruv_score2: {tns.ruv_score2 :.4}')
duv_score: 1.0
ruv_score0: 0.8222
ruv_score1: 0.8889
ruv_score2: 0.9753
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for refineryframe-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3793ce9696fac0d9c858ae9e9f459cc20f6011fb4c7698f26be78bfdd4835da1 |
|
MD5 | a6231290afd4c777d929b3c10ce79424 |
|
BLAKE2b-256 | b37a0f4ebc105e39d0dfd4ed874ed2ff98f3e71e9f32e9ad620783086618cddd |