Skip to main content

Fairness-Agnostic Data Optimization

Project description

FairDo is a Python package for mitigating bias in data. It can be used to create datasets to comply with the AI Act:

[...] data sets should also have the appropriate statistical properties, including as regards the persons or groups of persons in relation to whom the high-risk AI system is intended to be used, with specific attention to the mitigation of possible biases in the data sets. [...]

Why FairDo?

  • Fairness: Minimizes discrimination in datasets
  • Interpretability and integrity: Under- and oversampling technique
  • Works with tabular data: pandas.DataFrame
  • Simplicity: Follows .fit_transform() convention and includes many examples
  • Handles a variety of cases: non-binary groups, multiple protected attributes, individual fairness
  • Customizable: custom fairness definition, solver, objective

How does it work?

The pre-processing methods (fairdo.preprocessing.HeuristicWrapper and fairdo.preprocessing.DefaultPreprocessing) work by removing discriminatory data points. By doing so, the dataset becomes much more balanced and less biased towards a particular social group. We approach this task as a combinatorial optimization problem, which means selecting a subset of the dataset that minimizes the discrimination score. Because there are exponentially many subsets, our approach uses genetic algorithms.

Quick Example

In the following example, we use the COMPAS [1] dataset. The protected attribute is race and the label is recidivism. Here, we deploy the default pre-processor, which internally uses a genetic algorithm, to remove discriminatory samples of the given dataset. The default pre-processor prevents removing all individuals of a single group.

# fairdo package
from fairdo.utils.dataset import load_data
from fairdo.preprocessing import DefaultPreprocessing
# fairdo metrics
from fairdo.metrics import statistical_parity_abs_diff_max

# Loading a sample dataset with all required information
# data is a pandas.DataFrame
data, label, protected_attributes = load_data('compas', print_info=False)

# Initialize DefaultPreprocessing object
preprocessor = DefaultPreprocessing(protected_attribute=protected_attributes[0],
                                    label=label)

# Fit and transform the data
data_fair = preprocessor.fit_transform(dataset=data)

# Print no. samples and discrimination before and after
disc_before = statistical_parity_abs_diff_max(data[label],
                                              data[protected_attributes[0]].to_numpy())
disc_after = statistical_parity_abs_diff_max(data_fair[label],
                                             data_fair[protected_attributes[0]].to_numpy())
print(len(data), disc_before)
print(len(data_fair), disc_after)

By running this example, the resulting dataset usually has a statistical disparity score of <1% (max. score between all five races), while the original dataset exhibits 27% statistical disparity.

Advices

:rocket: For a quick start, use the DefaultPreprocessing class with the default settings. An example is given in tutorials/1. Default Preprocessor.

:white_check_mark: For data quality, you want to keep the original data $D$ and only add fair synthetic data $G$ on top of it. (We include examples in tutorials/ where we use the SDV package to generate synthetic data.) For this, you need to specify .fit_transform(approach='add') for the pre-processors fairdo.preprocessing.HeuristicWrapper and fairdo.preprocessing.DefaultPreprocessing. This will only add the fair pre-processed synthetic data to the original data.

:dash: When having limited data, we advise employing synthetic data $G$ additionally and merge it with the original data $D$, i.e., $D \cup G$. The pre-processor can then be used on the merged data $D \cup G$ to ensure fairness. It is also possible to use the methodology for data quality, as described above.

:briefcase: When anonymity is required, only use synthetic data $G$ and do not merge it with the original data $D$. The generated data $G$ can then be pre-processed with our methods to ensure fairness.

Installation

First, setup a Python environment. We recommend using Miniconda. Activate the created environment afterwards and finally install our package. A detailed guide is given as follows.

Dependencies

Python (==3.8), numpy, scipy, pandas, sklearn

Setup Conda Environment

Download Miniconda here.

# Create a conda virtual environment
conda create -n "venv" python=3.8

# Activate conda environment
conda activate venv

OR

Setup Python Environment

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
# On Windows:
.venv\Scripts\activate

# On macOS and Linux:
source .venv/bin/activate

PyPI Distribution (recommended)

The package is distributed via PyPI and can be directly installed with:

pip install fairdo

Manual Installation (latest version)

To install the latest (development) version, execute following commands:

# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git

# Move to repo folder
cd fairdo

# Install from source
python setup.py install

Install Optional Dependencies

To use the synthetic data generation, you can install the SDV package by executing the following command:

pip install sdv==1.10.0

We did not include the SDV package as a dependency, because it is not required for the core functionality of the FairDo package. Using any other synthetic data generation package is also possible. Still, some examples in the tutorials/ folder require the SDV package.

Citation

When using FairDo in your work, cite our paper:

@inproceedings{duong2023framework,
  title={Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes},
  author={Duong, Manh Khoi and Conrad, Stefan},
  booktitle={Data Science and Machine Learning},
  publisher={Springer Nature Singapore},
  number={CCIS 1943},
  series={AusDM: Australasian Conference on Data Science and Machine Learning},
  year={2023},
  pages={105--120},
  isbn={978-981-99-8696-5},
}

Notes

We credit OpenMoji for the emojis used in our logo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fairdo-0.1.10-py3-none-any.whl (71.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page