Skip to main content

Data Balance Analysis and Error mitigation steps on python

Project description

Project

This repo consists of a python library that aims to help users including data scientists debug and mitigate errors in their data so that they can build more fair and unbiased models starting from the data cleaning stage.

There are two main functions of this library: Data Balance Analysis and Error Mitigation

The goal of Data Balance Analysis to provide metrics that help to determine how balanced the data that is being trained on is.

Notebook Examples

Maintainers

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Data Balance Analysis:

FeatureBalanceMeasure

label_col : name of the column that contains the label for the data sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Feature Balance Measures on

DistributionBalanceMeasure

sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Distribution Balance Measures on

AggregateBalanceMeasure

sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Aggregate Balance Measures on

Data Processing: Preprocessing data component to help with splitting and transforming a dataset.

RandomSample

sample (dataset, target, sample_size, stratify = False)*

Return a data random sample or random stratify sample. We use Sklearn to enable this functionality.

Parameters:

dataset : Pandas Data Frame.

target : str, int

  • When str, it is representing the name of the label column

  • Wnen int, it corresponds the label column integer index (zero base)

sample_size :

  • The training data split size. The default is 0.9, which split the dataset to 90% training and 10% testing. Training and Test split values add up to 1.

stratify : bool, default is False.

  • If not None, data is split in a stratified fashion, using this as the class labels.

Return: A Pandas Frame dataset.

Split

split (dataset, target, train_size = 0.9, random_state = None, categorical_features = True, drop_null = True, drop_duplicates = False, stratify = False)

sklearn.model_selection.train_test_split

Split the dataset into training and testing sets. In the process, we handle null values, duplicate records, and transform all categorical features.

Parameters:

dataset : Pandas Data Frame

target : str, int

  • When str, it is representing the name of the label column

  • Wnen int, it corresponds to the label column integer index (zero base) of the target feature "

    train_size :

  • The training data split size. The default is 0.9, which split the dataset to 90% training and 10% testing. Training and Test split values add up to 1.

random_state :

  • Control the randomization of the algorithm.

  • ‘None’: the random number generator is the RandomState instance used by np.random.

categorical_features : bool, default=True

  • A flag to indicates the presence of categorical features.

drop_null : bool, default=True

  • If flag is set to True, records with null values are dropped, otherwise they are replaced by the mean.

drop_duplicates : bool, default=False

  • If flag is set to True, duplicate records are dropped.

stratify : bool, default=False

  • If not None, data is split in a stratified fashion, using this as the class labels.

Return: A NumPy array

Rebalance

rebalance (dataset, target, sampling_strategy = ‘auto’, random_state = None, smote_tomek = None, smote = None, tomek = None)*

Combine over- and under-sampling using SMOTE Tomek. Over-sampling using SMOTE and under-sampling using Tomek links.

Parameters:

dataset :

  • A Pandas Data Frame representing the data to rebalance.

target : str, int

  • For using as the classes for rebalancing the data.

  • When str, it is representing the name of the label column

  • Wnen int, it corresponds to the label column integer index (zero base) of the target feature

sampling_strategy : str

  • 'minority': resample only the minority class.

  • 'not minority': resample all classes but the minority class.

  • 'not majority': resample all classes but the majority class.

  • 'all': resample all classes.

  • 'auto': equivalent to 'not majority'.

random_state :

Control the randomization of the algorithm.
  • ‘None’: the random number generator is the RandomState instance used by np.random.

  • ‘If Int’: random_state is the seed used by the random number generator.

smote_tomek : The SMOTETomek object to use.

smote : The SMOTE object to use.

tomek : The TomekLinks object to use.

Return: A rebalanced NumPy array.

Note: The DataRebalance call with SMOTETomek object to use could be failing with following message: Expected n_neighbors <= n_samples, but n_samples = 3, n_neighbors = 6 when the data are not perfectly balanced and there are not enough samples (3 in the shown above error) and the number of neighbors is 6. The workaround solution could be rebalance with SMOTE and Tomek objects instead of SMOTETomek

Transform

transform (dataset, target, random_state = None, transformer_type, transform_features= None, method=None, output_distribution=None)*

Transform the data into a standardized or a normalized form.

sklearn.preprocessing

Parameters:

dataset :

  • A Pandas Data Frame representing the data to transform.

target : str, int

  • When str, it is representing the name of the label column

  • Wnen int, it corresponds to the label column integer index (zero base) of the target feature

transformer_type : enum Enum object for available transformations.

  • StandardScaler: sklearn.preprocessing.StandardScaler

    Standardize features by removing the mean and scaling to unit variance.
    z = (x - u) / s (where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False).

  • MinMaxScaler: sklearn.preprocessing.MinMaxScaler

    Transform features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

  • RobustScaler: sklearn.preprocessing.RobustScaler

    Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

  • PowerTransformer: sklearn.preprocessing.PowerTransformer

    Apply a power transform feature-wise to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
    Box-Cox transform requires input data to be strictly positive, while Yeo-Johnson supports both positive and negative data.

  • QuantileTransformer: sklearn.preprocessing.QuantileTransformer

    Transform features using quantiles information. This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

  • Normaliser: sklearn.preprocessing.Normalizer

    Normalize samples individually to unit norm. Each sample (i.e. each row of the data matrix) with at least one nonzero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

transform_features : List of the features to transform. The list could

be the indexes or the names of the features.

random_state : Control the randomization of the algorithm.

‘None’: the random number generator is the RandomState instance used by
np.random.

method : str, default=’yeo-johnson’ Possible choices are: ‘yeo-johnson’

‘box-cox’

output_distribution : str, default=‘uniform’ Possible choices are:

‘uniform’ ‘normal’ Marginal distribution for the transformed data.

Return: A NumPy array.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

raimitigations-0.0.3.tar.gz (18.7 kB view hashes)

Uploaded Source

Built Distribution

raimitigations-0.0.3-py3-none-any.whl (21.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page