Data Balance Analysis and Error mitigation steps on python
Project description
Project
This repo consists of a python library that aims to help users including data scientists debug and mitigate errors in their data so that they can build more fair and unbiased models starting from the data cleaning stage.
There are two main functions of this library: Data Balance Analysis and Error Mitigation
The goal of Data Balance Analysis to provide metrics that help to determine how balanced the data that is being trained on is.
Notebook Examples
- Data Balance Analysis Walk Through
- Data Balance Analysis Adult Census Example
- Random Sample Mitigation Example
- Data Rebalance Mitigation Example
- Data Split Example
- Data Transformer Example
- End to End Notebook
Maintainers
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
Data Balance Analysis:
FeatureBalanceMeasure
label_col : name of the column that contains the label for the data sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Feature Balance Measures on
DistributionBalanceMeasure
sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Distribution Balance Measures on
AggregateBalanceMeasure
sensitive_cols : a list of the columns of interest to analyze for data imbalances measures(df) Parameters: df : Pandas Data Frame to calculate the Aggregate Balance Measures on
Data Processing: Preprocessing data component to help with splitting and transforming a dataset.
RandomSample
sample (dataset, target, sample_size, stratify = False)*
Return a data random sample or random stratify sample. We use Sklearn to enable this functionality.
Parameters:
dataset : Pandas Data Frame.
target : str, int
-
When str, it is representing the name of the label column
-
Wnen int, it corresponds the label column integer index (zero base)
sample_size :
- The training data split size. The default is 0.9, which split the dataset to 90% training and 10% testing. Training and Test split values add up to 1.
stratify : bool, default is False.
- If not None, data is split in a stratified fashion, using this as the class labels.
Return: A Pandas Frame dataset.
Split
split (dataset, target, train_size = 0.9, random_state = None, categorical_features = True, drop_null = True, drop_duplicates = False, stratify = False)
sklearn.model_selection.train_test_split
Split the dataset into training and testing sets. In the process, we handle null values, duplicate records, and transform all categorical features.
Parameters:
dataset : Pandas Data Frame
target : str, int
-
When str, it is representing the name of the label column
-
Wnen int, it corresponds to the label column integer index (zero base) of the target feature "
train_size :
-
The training data split size. The default is 0.9, which split the dataset to 90% training and 10% testing. Training and Test split values add up to 1.
random_state :
-
Control the randomization of the algorithm.
-
‘None’: the random number generator is the RandomState instance used by np.random.
categorical_features : bool, default=True
- A flag to indicates the presence of categorical features.
drop_null : bool, default=True
- If flag is set to True, records with null values are dropped, otherwise they are replaced by the mean.
drop_duplicates : bool, default=False
- If flag is set to True, duplicate records are dropped.
stratify : bool, default=False
- If not None, data is split in a stratified fashion, using this as the class labels.
Return: A NumPy array
Rebalance
rebalance (dataset, target, sampling_strategy = ‘auto’, random_state = None, smote_tomek = None, smote = None, tomek = None)*
Combine over- and under-sampling using SMOTE Tomek. Over-sampling using SMOTE and under-sampling using Tomek links.
Parameters:
dataset :
- A Pandas Data Frame representing the data to rebalance.
target : str, int
-
For using as the classes for rebalancing the data.
-
When str, it is representing the name of the label column
-
Wnen int, it corresponds to the label column integer index (zero base) of the target feature
sampling_strategy : str
-
'minority': resample only the minority class.
-
'not minority': resample all classes but the minority class.
-
'not majority': resample all classes but the majority class.
-
'all': resample all classes.
-
'auto': equivalent to 'not majority'.
random_state :
Control the randomization of the algorithm.
-
‘None’: the random number generator is the RandomState instance used by np.random.
-
‘If Int’: random_state is the seed used by the random number generator.
smote_tomek : The SMOTETomek object to use.
-
If not given by Caller, a SMOTE object with default parameters will be given.
smote : The SMOTE object to use.
-
If not given by Caller, a SMOTE object with default parameters will be given.
tomek : The TomekLinks object to use.
-
If not given by Caller, a TomekLinks object with sampling strategy=’all’ will be given.
Return: A rebalanced NumPy array.
Note: The DataRebalance call with SMOTETomek object to use could be failing with following message: Expected n_neighbors <= n_samples, but n_samples = 3, n_neighbors = 6 when the data are not perfectly balanced and there are not enough samples (3 in the shown above error) and the number of neighbors is 6. The workaround solution could be rebalance with SMOTE and Tomek objects instead of SMOTETomek
Transform
transform (dataset, target, random_state = None, transformer_type, transform_features= None, method=None, output_distribution=None)*
Transform the data into a standardized or a normalized form.
Parameters:
dataset :
- A Pandas Data Frame representing the data to transform.
target : str, int
-
When str, it is representing the name of the label column
-
Wnen int, it corresponds to the label column integer index (zero base) of the target feature
transformer_type : enum Enum object for available transformations.
-
StandardScaler: sklearn.preprocessing.StandardScaler
Standardize features by removing the mean and scaling to unit variance.
z = (x - u) / s (where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False). -
MinMaxScaler: sklearn.preprocessing.MinMaxScaler
Transform features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
-
RobustScaler: sklearn.preprocessing.RobustScaler
Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
-
PowerTransformer: sklearn.preprocessing.PowerTransformer
Apply a power transform feature-wise to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Box-Cox transform requires input data to be strictly positive, while Yeo-Johnson supports both positive and negative data. -
QuantileTransformer: sklearn.preprocessing.QuantileTransformer
Transform features using quantiles information. This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
-
Normaliser: sklearn.preprocessing.Normalizer
Normalize samples individually to unit norm. Each sample (i.e. each row of the data matrix) with at least one nonzero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.
transform_features : List of the features to transform. The list could
be the indexes or the names of the features.
random_state : Control the randomization of the algorithm.
‘None’: the random number generator is the RandomState instance used by
np.random.
method : str, default=’yeo-johnson’ Possible choices are: ‘yeo-johnson’
‘box-cox’
output_distribution : str, default=‘uniform’ Possible choices are:
‘uniform’ ‘normal’ Marginal distribution for the transformed data.
Return: A NumPy array.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file raimitigations-0.0.3.tar.gz
.
File metadata
- Download URL: raimitigations-0.0.3.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db4929071330c954d762dd1f33c1c420a5dec8f20956d4869b3e92dd177fbe65 |
|
MD5 | 6215267fda27ebd393383f8911eca41e |
|
BLAKE2b-256 | 1a5d544b4e8e96c2303953d6113cb6ceefd5dd39a5cf14289b3e185fe5874a0c |
File details
Details for the file raimitigations-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: raimitigations-0.0.3-py3-none-any.whl
- Upload date:
- Size: 21.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 133b9faf9b2c3682bc4313e24e38b39e9fc830e602f3490f51a9320130769005 |
|
MD5 | b867d3252a3ec1d175af1e76eb3b5f81 |
|
BLAKE2b-256 | 2f32ddf1bd380007874f6995eba92afec1a23eb92b8777c4e5b8a07b54a99a50 |