Fairness-Agnostic Data Optimization
Project description
FairDo is a Python package for mitigating bias in data. It can be used to create datasets to comply with the AI Act:
[...] data sets should also have the appropriate statistical properties, including as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used, with specific attention to the mitigation of possible biases in the data sets. [...]
- Official repository of: Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes
- Documentation: https://fairdo.readthedocs.io/en/latest/
- Source Code: https://github.com/mkduong-ai/fairdo/tree/main
Why FairDo?
- Fairness: Minimizes discrimination in datasets
- Interpretability and integrity: Under- and oversampling technique
- Works with tabular data:
pandas.DataFrame
- Simplicity: Follows
.fit_transform()
convention and includes many examples - Handles a variety of cases: non-binary groups, multiple protected attributes, individual fairness
- Customizable: custom fairness definition, solver, objective
How does it work?
The pre-processing methods (fairdo.preprocessing.HeuristicWrapper
and fairdo.preprocessing.DefaultPreprocessing
)
work by removing discriminatory data points.
By doing so, the dataset becomes much more balanced and less biased towards
a particular social group.
We approach this task as a combinatorial optimization problem, which
means selecting a subset of the dataset that minimizes the discrimination score.
Because there are exponentially many subsets,
our approach uses genetic algorithms.
Quick Example
In the following example, we use the COMPAS [1] dataset. The protected attribute is race and the label is recidivism. Here, we deploy the default pre-processor, which internally uses a genetic algorithm, to remove discriminatory samples of the given dataset. The default pre-processor prevents removing all individuals of a single group.
# fairdo package
from fairdo.utils.dataset import load_data
from fairdo.preprocessing import DefaultPreprocessing
# fairdo metrics
from fairdo.metrics import statistical_parity_abs_diff_max
# Loading a sample dataset with all required information
# data is a pandas.DataFrame
data, label, protected_attributes = load_data('compas', print_info=False)
# Initialize DefaultPreprocessing object
preprocessor = DefaultPreprocessing(protected_attribute=protected_attributes[0],
label=label)
# Fit and transform the data
data_fair = preprocessor.fit_transform(dataset=data)
# Print no. samples and discrimination before and after
disc_before = statistical_parity_abs_diff_max(data[label],
data[protected_attributes[0]].to_numpy())
disc_after = statistical_parity_abs_diff_max(data_fair[label],
data_fair[protected_attributes[0]].to_numpy())
print(len(data), disc_before)
print(len(data_fair), disc_after)
By running this example, the resulting dataset usually has a statistical disparity score of <1% (max. score between all five races), while the original dataset exhibits 27% statistical disparity.
Advices
:rocket: For a quick start, use the DefaultPreprocessing
class with the default settings.
An example is given in tutorials/1. Default Preprocessor
.
:white_check_mark: For data quality, you want to keep the original data $D$
and only add fair synthetic data $G$ on top of it.
(We include examples in tutorials/
where we use the SDV package
to generate synthetic data.)
For this, you need to specify
.fit_transform(approach='add')
for the pre-processors fairdo.preprocessing.HeuristicWrapper
and
fairdo.preprocessing.DefaultPreprocessing
.
This will only add the fair pre-processed synthetic data
to the original data.
:dash: When having limited data, we advise employing synthetic data $G$ additionally and merge it with the original data $D$, i.e., $D \cup G$. The pre-processor can then be used on the merged data $D \cup G$ to ensure fairness. It is also possible to use the methodology for data quality, as described above.
:briefcase: When anonymity is required, only use synthetic data $G$ and do not merge it with the original data $D$. The generated data $G$ can then be pre-processed with our methods to ensure fairness.
Installation
First, setup a Python environment. We recommend using Miniconda. Activate the created environment afterwards and finally install our package. A detailed guide is given as follows.
Dependencies
Python (==3.8), numpy
, scipy
, pandas
, sklearn
Setup Conda Environment
Download Miniconda here.
# Create a conda virtual environment
conda create -n "venv" python=3.8
# Activate conda environment
conda activate venv
OR
Setup Python Environment
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
# On Windows:
.venv\Scripts\activate
# On macOS and Linux:
source .venv/bin/activate
PyPI Distribution (recommended)
The package is distributed via PyPI and can be directly installed with:
pip install fairdo
Manual Installation (latest version)
To install the latest (development) version, execute following commands:
# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git
# Move to repo folder
cd fairdo
# Install from source
python setup.py install
Install Optional Dependencies
To use the synthetic data generation, you can install the SDV
package
by executing the following command:
pip install sdv==1.10.0
We did not include the SDV
package as a dependency, because it is not required
for the core functionality of the FairDo package.
Using any other synthetic data generation package is also possible.
Still, some examples in the tutorials/
folder require the SDV
package.
Citation
When using FairDo in your work, cite our paper:
@inproceedings{duong2023framework,
title={Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes},
author={Duong, Manh Khoi and Conrad, Stefan},
booktitle={Data Science and Machine Learning},
publisher={Springer Nature Singapore},
number={CCIS 1943},
series={AusDM: Australasian Conference on Data Science and Machine Learning},
year={2023},
pages={105--120},
isbn={978-981-99-8696-5},
}
Notes
We credit OpenMoji for the emojis used in our logo.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.