Cleans data, best to be used as a part of initial preprocessor

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Project description

refineryframe

The goal of the package is to simplify life for data scientists, that have to deal with imperfect raw data. The package suppose to detect and clean unexpected values, while doubling as safeguard in production code based on predifined conditions that arise from business assumptions or any other source. The package is well suited to be an initial preprocessing step in ml pipelines situated between data gathering and training/scoring steps.

Installation

Install refineryframe via pip with

pip install refineryframe

Feature List

refineryframe.refiner.Refiner.check_col_names_types - checks if a given dataframe has the same column names as keys in a given dictionary and those columns have the same types as items in the dictionary.
refineryframe.refiner.Refiner.check_date_format - checks if the values in the datetime columns of the input dataframe have the expected 'YYYY-MM-DD' format.
refineryframe.refiner.Refiner.check_date_range - checks if dates are in expected ranges.
refineryframe.refiner.Refiner.check_duplicates - checks for duplicates in a pandas DataFrame.
refineryframe.refiner.Refiner.check_inf_values - counts the inf values in each column of a pandas DataFrame.
refineryframe.refiner.Refiner.check_missing_types - takes a DataFrame and a dictionary of missing types as input, and searches for any instances of these missing types in each column of the DataFrame.
refineryframe.refiner.Refiner.check_missing_values - counts the number of NaN, None, and NaT values in each column of a pandas DataFrame.
refineryframe.refiner.Refiner.check_numeric_range - checks if numeric values are in expected ranges.
refineryframe.refiner.Refiner.detect_unexpected_values - detects unexpected values in a pandas DataFrame.
refineryframe.refiner.Refiner.get_refiner_settings - extracts values of parameters from refiner and saves them in dictionary for later use.
refineryframe.refiner.Refiner.get_type_dict_from_dataframe - returns a dictionary or string representation of a dictionary containing the data types of each column in the given pandas DataFrame.
refineryframe.refiner.Refiner.get_unexpected_exceptions_scaned - returns unexpected_exceptions with appropriate settings to the values in the dataframe.
refineryframe.refiner.Refiner.replace_unexpected_values - replaces unexpected values in a pandas DataFrame with missing types.
refineryframe.refiner.Refiner.set_refiner_settings - updates input parameters with values from provided settings dict.
refineryframe.refiner.Refiner.set_type_dict - changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.
refineryframe.refiner.Refiner.set_types - changes the data types of the columns in the given DataFrame based on a dictionary of intended data types.

Package usage example

Content:

Initializing Refiner class
- defining general conditions
Use of simple general conditions
Use of complex targeted conditions
- to detect unexpected
- to replace unexpected
Refiner class settings
Data quality scores
- duv score
- ruv scores

Creating example data (exceptionally messy dataframe)

import os 
import sys 
import numpy as np
import pandas as pd
import logging
sys.path.append(os.path.dirname(sys.path[0])) 
from refineryframe.refiner import Refiner

df = pd.DataFrame({
    'num_id' : [1, 2, 3, 4, 5],
    'NumericColumn': [1, -np.inf, np.inf,np.nan, None],
    'NumericColumn_exepted': [1, -996, np.inf,np.nan, None],
    'NumericColumn2': [None, None, 1,None, None],
    'NumericColumn3': [1, 2, 3, 4, 5],
    'DateColumn': pd.date_range(start='2022-01-01', periods=5),
    'DateColumn2': [pd.NaT,pd.to_datetime('2022-01-01'),pd.NaT,pd.NaT,pd.NaT],
    'DateColumn3': ['2122-05-01',
                    '2022-01-01',
                    '2021-01-01',
                    '1000-01-09',
                    '1850-01-09'],
    'CharColumn': ['Fół', None, np.nan, 'nót eXpęćTęd', '']
})

df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	num_id	NumericColumn	NumericColumn_exepted	NumericColumn2	NumericColumn3	DateColumn	DateColumn2	DateColumn3	CharColumn
0	1	1.0	1.0	NaN	1	2022-01-01	NaT	2122-05-01	Fół
1	2	-inf	-996.0	NaN	2	2022-01-02	2022-01-01	2022-01-01	None
2	3	inf	inf	1.0	3	2022-01-03	NaT	2021-01-01	NaN
3	4	NaN	NaN	NaN	4	2022-01-04	NaT	1000-01-09	nót eXpęćTęd
4	5	NaN	NaN	NaN	5	2022-01-05	NaT	1850-01-09

Defining specification for the dataframe

MISSING_TYPES = {'date_not_delivered': '1850-01-09',
                 'date_other_missing_type': '1850-01-08',
                 'numeric_not_delivered': -999,
                 'character_not_delivered': 'missing'}

unexpected_exceptions = {
    "col_names_types": "NONE",
    "missing_values": ["NumericColumn_exepted"],
    "missing_types": "NONE",
    "inf_values": "NONE",
    "date_format": "NONE",
    "duplicates": "ALL",
    "date_range": "NONE",
    "numeric_range": "NONE"
}

replace_dict = {-996 : -999,
                "1000-01-09": "1850-01-09"}

Initializing Refiner class

tns = Refiner(dataframe = df,
              replace_dict = replace_dict,
              loggerLvl = logging.DEBUG,
              unexpected_exceptions_duv = unexpected_exceptions)

function for detecting column types

tns.get_type_dict_from_dataframe()

{'num_id': 'int64',
 'NumericColumn': 'float64',
 'NumericColumn_exepted': 'float64',
 'NumericColumn2': 'float64',
 'NumericColumn3': 'int64',
 'DateColumn': 'datetime64[ns]',
 'DateColumn2': 'datetime64[ns]',
 'DateColumn3': 'object',
 'CharColumn': 'object'}

adding expected types

types_dict_str = {'num_id' : 'int64', 
                   'NumericColumn' : 'float64', 
                   'NumericColumn_exepted' : 'float64', 
                   'NumericColumn2' : 'float64', 
                   'NumericColumn3' : 'int64', 
                   'DateColumn' : 'datetime64[ns]', 
                   'DateColumn2' : 'datetime64[ns]', 
                   'DateColumn3' : 'datetime64[ns]', 
                   'CharColumn' : 'object'}

Use of simple general conditions

Check independent conditions

tns.check_missing_types()
tns.check_missing_values()
tns.check_inf_values()
tns.check_col_names_types()
tns.check_date_format()
tns.check_duplicates()
tns.check_numeric_range()

WARNING:Refiner:Column DateColumn3: (1850-01-09) : 1 : 20.00%
WARNING:Refiner:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column CharColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.

moulding types

tns.set_types(type_dict = types_dict_str)

tns.get_type_dict_from_dataframe()

{'num_id': 'int64',
 'NumericColumn': 'float64',
 'NumericColumn_exepted': 'float64',
 'NumericColumn2': 'float64',
 'NumericColumn3': 'int64',
 'DateColumn': 'datetime64[ns]',
 'DateColumn2': 'datetime64[ns]',
 'DateColumn3': 'datetime64[ns]',
 'CharColumn': 'object'}

Using the main function to detect unexpected values

tns.detect_unexpected_values(earliest_date = "1920-01-01",
                         latest_date = "DateColumn3")

DEBUG:Refiner:=== checking column names and types
DEBUG:Refiner:=== checking for presence of missing values
WARNING:Refiner:Column CharColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn2: (NA) : 4 : 80.00%
DEBUG:Refiner:=== checking for presence of missing types
WARNING:Refiner:Column DateColumn3: (1850-01-09) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (-999) : 1 : 20.00%
DEBUG:Refiner:=== checking propper date format
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
DEBUG:Refiner:=== checking expected date range
WARNING:Refiner:** Not all dates in DateColumn are later than DateColumn3
WARNING:Refiner:Column DateColumn : future date : 4 : 80.00%
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
DEBUG:Refiner:=== checking expected numeric range
WARNING:Refiner:Percentage of passed tests: 50.00%

tns.duv_score

0.5

Using function to replace unexpected values with missing types

tns.replace_unexpected_values(numeric_lower_bound = "NumericColumn3",
                                numeric_upper_bound = 4,
                                earliest_date = "1920-01-02",
                                latest_date = "DateColumn2",
                                unexpected_exceptions = {"irregular_values": "NONE",
                                                            "date_range": "DateColumn",
                                                            "numeric_range": "NONE",
                                                            "capitalization": "NONE",
                                                            "unicode_character": "NONE"})

DEBUG:Refiner:=== replacing missing values in category cols with missing types
DEBUG:Refiner:=== replacing all upper case characters with lower case
DEBUG:Refiner:=== replacing character unicode to latin
DEBUG:Refiner:=== replacing missing values in date cols with missing types
DEBUG:Refiner:=== replacing missing values in numeric cols with missing types
DEBUG:Refiner:=== replacing values outside of expected date range
DEBUG:Refiner:=== replacing values outside of expected numeric range
DEBUG:Refiner:** Usable values in the dataframe:  44.44%
DEBUG:Refiner:** Uncorrected data quality score:  32.22%
DEBUG:Refiner:** Corrected data quality score:  52.57%

Use of complex targeted conditions

unexpected_conditions = {
    '1': {
        'description': 'Replace numeric missing with with zero',
        'group': 'regex_columns',
        'features': r'^Numeric',
        'query': "{col} < 0",
        'warning': True,
        'set': 0
    },
    '2': {
        'description': "Clean text column from '-ing' endings and 'not ' beginings",
        'group': 'regex clean',
        'features': ['CharColumn'],
        'query': [r'ing', r'^not.'],
        'warning': False,
        'set': ''
    },
    '3': {
        'description': "Detect/Replace numeric values in certain column with zeros if > 2",
        'group': 'multicol mapping',
        'features': ['NumericColumn3'],
        'query': '{col} > 2',
        'warning': True,
        'set': 0
    },
    '4': {
        'description': "Replace strings with values if some part of the string is detected",
        'group': 'string check',
        'features': ['CharColumn'],
        'query': f"CharColumn.str.contains('cted', regex = True)",
        'warning': False,
        'set': 'miss'
    }
    }

- to detect unexpected values

tns.detect_unexpected_values(unexpected_conditions = unexpected_conditions)

DEBUG:Refiner:=== checking column names and types
WARNING:Refiner:Incorrect data types:
WARNING:Refiner:Column num_id: actual dtype is object, expected dtype is int64
DEBUG:Refiner:=== checking for presence of missing values
DEBUG:Refiner:=== checking for presence of missing types
WARNING:Refiner:Column CharColumn: (missing) : 3 : 60.00%
WARNING:Refiner:Column DateColumn2: (1850-01-09) : 4 : 80.00%
WARNING:Refiner:Column DateColumn3: (1850-01-09) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn: (-999) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn_exepted: (-999) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn2: (-999) : 5 : 100.00%
WARNING:Refiner:Column NumericColumn3: (-999) : 1 : 20.00%
DEBUG:Refiner:=== checking propper date format
DEBUG:Refiner:=== checking expected date range
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
DEBUG:Refiner:=== checking expected numeric range
DEBUG:Refiner:=== checking additional cons
DEBUG:Refiner:Replace numeric missing with with zero
WARNING:Refiner:Replace numeric missing with with zero :: 1
DEBUG:Refiner:Detect/Replace numeric values in certain column with zeros if > 2
WARNING:Refiner:Detect/Replace numeric values in certain column with zeros if > 2 :: 2
WARNING:Refiner:Percentage of passed tests: 66.67%

- to replace unexpected values

tns.replace_unexpected_values(unexpected_conditions = unexpected_conditions)

DEBUG:Refiner:=== replacing missing values in category cols with missing types
DEBUG:Refiner:=== replacing all upper case characters with lower case
DEBUG:Refiner:=== replacing character unicode to latin
DEBUG:Refiner:=== replacing with additional cons
DEBUG:Refiner:Replace numeric missing with with zero
DEBUG:Refiner:Clean text column from '-ing' endings and 'not ' beginings
DEBUG:Refiner:Detect/Replace numeric values in certain column with zeros if > 2
DEBUG:Refiner:Replace strings with values if some part of the string is detected
DEBUG:Refiner:=== replacing missing values in date cols with missing types
DEBUG:Refiner:=== replacing missing values in numeric cols with missing types
DEBUG:Refiner:=== replacing values outside of expected date range
DEBUG:Refiner:=== replacing values outside of expected numeric range
DEBUG:Refiner:** Usable values in the dataframe:  82.22%
DEBUG:Refiner:** Uncorrected data quality score:  88.89%
DEBUG:Refiner:** Corrected data quality score:  97.53%

tns.dataframe

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	num_id	NumericColumn	NumericColumn_exepted	NumericColumn3	DateColumn	DateColumn2	DateColumn3	CharColumn
0	1	1.0	1.0	1	2022-01-01	1850-01-09	1850-01-09	fol
1	2	0.0	0.0	2	2022-01-02	2022-01-01	2022-01-01	miss
2	3	0.0	0.0	0	2022-01-03	1850-01-09	1850-01-09	miss
3	4	0.0	0.0	0	2022-01-04	1850-01-09	1850-01-09	miss
4	5	0.0	0.0	0	2022-01-05	1850-01-09	1850-01-09	miss

tns.detect_unexpected_values(unexpected_exceptions = {
    "col_names_types": "NONE",
    "missing_values": "NONE",
    "missing_types": "ALL",
    "inf_values": "NONE",
    "date_format": "NONE",
    "duplicates": "ALL",
    "date_range": "NONE",
    "numeric_range": "NONE"
})

DEBUG:Refiner:=== checking column names and types
WARNING:Refiner:Incorrect data types:
WARNING:Refiner:Column num_id: actual dtype is object, expected dtype is int64
DEBUG:Refiner:=== checking for presence of missing values
DEBUG:Refiner:=== checking propper date format
DEBUG:Refiner:=== checking expected date range
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
DEBUG:Refiner:=== checking expected numeric range
WARNING:Refiner:Percentage of passed tests: 88.89%

Scores

print(f'duv_score: {tns.duv_score :.4}')
print(f'ruv_score0: {tns.ruv_score0 :.4}')
print(f'ruv_score1: {tns.ruv_score1 :.4}')
print(f'ruv_score2: {tns.ruv_score2 :.4}')

duv_score: 0.8889
ruv_score0: 0.8222
ruv_score1: 0.8889
ruv_score2: 0.9753

Refiner class settings

import os 
import sys 
import numpy as np
import pandas as pd
import logging
sys.path.append(os.path.dirname(sys.path[0])) 
from refineryframe.refiner import Refiner

df = pd.DataFrame({
    'num_id' : [1, 2, 3, 4, 5],
    'NumericColumn': [1, -np.inf, np.inf,np.nan, None],
    'NumericColumn_exepted': [1, -996, np.inf,np.nan, None],
    'NumericColumn2': [None, None, 1,None, None],
    'NumericColumn3': [1, 2, 3, 4, 5],
    'DateColumn': pd.date_range(start='2022-01-01', periods=5),
    'DateColumn2': [pd.NaT,pd.to_datetime('2022-01-01'),pd.NaT,pd.NaT,pd.NaT],
    'DateColumn3': ['2122-05-01',
                    '2022-01-01',
                    '2021-01-01',
                    '1000-01-09',
                    '1850-01-09'],
    'CharColumn': ['Fół', None, np.nan, 'nót eXpęćTęd', '']
})

df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	num_id	NumericColumn	NumericColumn_exepted	NumericColumn2	NumericColumn3	DateColumn	DateColumn2	DateColumn3	CharColumn
0	1	1.0	1.0	NaN	1	2022-01-01	NaT	2122-05-01	Fół
1	2	-inf	-996.0	NaN	2	2022-01-02	2022-01-01	2022-01-01	None
2	3	inf	inf	1.0	3	2022-01-03	NaT	2021-01-01	NaN
3	4	NaN	NaN	NaN	4	2022-01-04	NaT	1000-01-09	nót eXpęćTęd
4	5	NaN	NaN	NaN	5	2022-01-05	NaT	1850-01-09

Defining specification for the dataframe

MISSING_TYPES = {'date_not_delivered': '1850-01-09',
                 'date_other_missing_type': '1850-01-08',
                 'numeric_not_delivered': -999,
                 'character_not_delivered': 'missing'}

unexpected_exceptions = {
    "col_names_types": "NONE",
    "missing_values": "ALL",
    "missing_types": "ALL",
    "inf_values": "NONE",
    "date_format": "NONE",
    "duplicates": "ALL",
    "date_range": "NONE",
    "numeric_range": "ALL"
}

replace_dict = {-996 : -999,
                "1000-01-09": "1850-01-09"}

Initializing Refiner class

tns = Refiner(dataframe = df,
              replace_dict = replace_dict,
              loggerLvl = logging.DEBUG,
              unexpected_exceptions_duv = unexpected_exceptions)

using the main function to detect unexpected values

tns.detect_unexpected_values()

DEBUG:Refiner:=== checking column names and types
DEBUG:Refiner:=== checking propper date format
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
DEBUG:Refiner:=== checking expected date range
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Percentage of passed tests: 66.67%

extracting Refiner settings

refiner_settings = tns.get_refiner_settings()
refiner_settings

{'replace_dict': {-996: -999, '1000-01-09': '1850-01-09'},
 'MISSING_TYPES': {'date_not_delivered': '1850-01-09',
  'numeric_not_delivered': -999,
  'character_not_delivered': 'missing'},
 'expected_date_format': '%Y-%m-%d',
 'mess': 'INITIAL PREPROCESSING',
 'shout_type': 'HEAD2',
 'logger_name': 'Refiner',
 'loggerLvl': 10,
 'dotline_length': 50,
 'lower_bound': -inf,
 'upper_bound': inf,
 'earliest_date': '1900-08-25',
 'latest_date': '2100-01-01',
 'ids_for_dedup': 'ALL',
 'unexpected_exceptions_duv': {'col_names_types': 'NONE',
  'missing_values': 'ALL',
  'missing_types': 'ALL',
  'inf_values': 'NONE',
  'date_format': 'NONE',
  'duplicates': 'ALL',
  'date_range': 'NONE',
  'numeric_range': 'ALL'},
 'unexpected_exceptions_ruv': {'irregular_values': 'NONE',
  'date_range': 'NONE',
  'numeric_range': 'NONE',
  'capitalization': 'NONE',
  'unicode_character': 'NONE'},
 'unexpected_conditions': None,
 'ignore_values': [],
 'ignore_dates': [],
 'type_dict': {}}

Initializing new clean Refiner

tns2 = Refiner(dataframe = df)

scanning dataframe for unexpected conditions

scanned_unexpected_exceptions = tns2.get_unexpected_exceptions_scaned()
scanned_unexpected_exceptions

WARNING:Refiner:Column CharColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column DateColumn3: (1850-01-09) : 1 : 20.00%
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Percentage of passed tests: 71.43%





{'col_names_types': 'NONE',
 'missing_values': 'ALL',
 'missing_types': 'ALL',
 'inf_values': 'ALL',
 'date_format': 'ALL',
 'duplicates': 'NONE',
 'date_range': 'NONE',
 'numeric_range': 'NONE'}

detection before applying settings

tns2.detect_unexpected_values()

WARNING:Refiner:Column CharColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column DateColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column NumericColumn: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (NA) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn2: (NA) : 4 : 80.00%
WARNING:Refiner:Column DateColumn3: (1850-01-09) : 1 : 20.00%
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Percentage of passed tests: 71.43%

using saved refiner settings for new instance

tns2.set_refiner_settings(refiner_settings)

tns2.detect_unexpected_values()

DEBUG:Refiner:=== checking column names and types
DEBUG:Refiner:=== checking propper date format
WARNING:Refiner:Column DateColumn2 has non-date values or unexpected format.
DEBUG:Refiner:=== checking expected date range
DEBUG:Refiner:=== checking for presense of inf values in numeric colums
WARNING:Refiner:Column NumericColumn: (INF) : 2 : 40.00%
WARNING:Refiner:Column NumericColumn_exepted: (INF) : 1 : 20.00%
WARNING:Refiner:Percentage of passed tests: 66.67%

tns3 = Refiner(dataframe = df, 
               unexpected_exceptions_duv = scanned_unexpected_exceptions)

tns3.detect_unexpected_values()
print(f'duv score: {tns3.duv_score}')

duv score: 1.0

Data quality scores

DUV score

The score is the result of checking general conditions with .detect_unexpected_values().

It is a percentange of checks that passed. Ideally the score is 1, worse case scenario the score is 0.

score_{duv} = \frac{\sum^{n}_{i=1} \text{check}_{i}}{n}

RUV scores

The scores are the result of cleaning data with .replace_unexpected_values().

The goal of those scores originated in determining whether the dataset can be used to train a model or not, based on missing or effectivelly missing values. The most advanced one is RUV_score2, aims to accurately classiffy if the dataset is hopeless, which can be used as safeguad to signal critical fall in quality of collected data in production. These scores are experimental so use them with causion.

RUV_score0 is the simplest one, it is just a differance between 1 and proportion of number of missing values to number of all values available in the dataframe. This can be understood and usable portions of the dataframe.

score_{ruv0} = 1 - \frac{\sum^{n*m}_{i=1;j=1} \text{unusable}_{ij}}{n*m}

RUV_score1 is a simplified version of RUV_score2, the proportions of medians are not squared with makes the score worse for classification but better for traking data quality over time.

$med_{col} = med{\frac{\sum^{n}_{i=1} \text{unusable}_{i}}{n}}$

count number of unexpected values along each column, divide it by number of rows and calculate median of that proportion

$med_{row} = med{\frac{\sum^{m}_{j=1} \text{unusable}_{j}}{m}}$

count number of unexpected values along each row, divide it by number of columns and calculate median of that proportion

$$ score_{ruv1} = 1 - \frac{med_{col} + med_{row}}{2} $$

RUV_score2 takes advantage of the fact that if one has too many rows or columns of data that are completly unusable or some kind of mix of those situations, the dataset does become unusable. The values below 0.5 supposedly indicate completelly unusable dataset.

$$ score_{ruv2} = 1 - \frac{med_{col}^2 + med_{row}^2}{2} $$

Project details

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

0.2.2

Sep 5, 2023

0.2.1

Sep 4, 2023

0.1.28

Sep 4, 2023

0.1.27

Sep 4, 2023

0.1.26

Sep 4, 2023

0.1.25

Sep 4, 2023

0.1.24

Aug 18, 2023

0.1.23

Aug 18, 2023

0.1.22

Aug 15, 2023

0.1.21

Aug 15, 2023

0.1.20

Aug 15, 2023

0.1.19

Aug 14, 2023

0.1.18

Aug 13, 2023

0.1.17

Aug 13, 2023

0.1.16

Aug 10, 2023

0.1.15

Aug 9, 2023

0.1.14

Aug 6, 2023

0.1.13

Aug 5, 2023

0.1.12

Aug 5, 2023

This version

0.1.11

Aug 5, 2023

0.1.10

Aug 5, 2023

0.1.9

Aug 5, 2023

0.1.8

Aug 5, 2023

0.1.7

Aug 5, 2023

0.1.6

Aug 5, 2023

0.1.5

Aug 5, 2023

0.1.4

Aug 5, 2023

0.1.3

Aug 3, 2023

0.1.2

Aug 2, 2023

0.1.1

Aug 1, 2023

0.0.10

Jul 31, 2023

0.0.9

Jul 30, 2023

0.0.8

Jul 30, 2023

0.0.7

Jul 30, 2023

0.0.6

Jul 29, 2023

0.0.5

Jul 26, 2023

0.0.4

Jul 25, 2023

0.0.3

Jul 25, 2023

0.0.2

Jul 25, 2023

0.0.1

Jul 25, 2023

0.0.0

Jul 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refineryframe-0.1.11.tar.gz (35.2 kB view hashes)

Uploaded Aug 5, 2023 Source

Built Distribution

refineryframe-0.1.11-py3-none-any.whl (40.6 kB view hashes)

Uploaded Aug 5, 2023 Python 3

Hashes for refineryframe-0.1.11.tar.gz

Hashes for refineryframe-0.1.11.tar.gz
Algorithm	Hash digest
SHA256	`9c6d20fe53451e21372baab4db8d2deb74ad337cafbdd30908a9ff863f904ed9`
MD5	`7b560e907a0f45a8d5d055b60522a8f5`
BLAKE2b-256	`7fe0dd5a1e89f279765af5d81a1e988f7b518cedb257a5c7908b963ee0f73804`

Hashes for refineryframe-0.1.11-py3-none-any.whl

Hashes for refineryframe-0.1.11-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d9685fec4f276f2d72399a6785e8f889c78952fc3ad6e3a78782ba7ef37e1cd`
MD5	`c13767c0d3e15551d7fd2a701a992b65`
BLAKE2b-256	`47f20a912556a51af6f6aec2e65e860f086737faa0bee36ab380b4d827cfea7c`

refineryframe 0.1.11

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

refineryframe

Installation

Feature List

Package usage example

Content:

Creating example data (exceptionally messy dataframe)

Defining specification for the dataframe

Initializing Refiner class

function for detecting column types

adding expected types

Use of simple general conditions

Check independent conditions

moulding types

Using the main function to detect unexpected values

Using function to replace unexpected values with missing types

Use of complex targeted conditions

- to detect unexpected values

- to replace unexpected values

Scores

Refiner class settings

Defining specification for the dataframe

Initializing Refiner class

using the main function to detect unexpected values

extracting Refiner settings

Initializing new clean Refiner

scanning dataframe for unexpected conditions

detection before applying settings

using saved refiner settings for new instance

Data quality scores

DUV score

RUV scores

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution