Fairness-Agnostic Data Optimization

These details have not been verified by PyPI

Project links

Project description

Fairness-Agnostic Data Optimization

FairDo is a Python package for mitigating bias in data.

Official repository of: Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes
Documentation: https://fairdo.readthedocs.io/en/latest/
Source Code: https://github.com/mkduong-ai/fairdo/tree/main

FairDo works specifically for tabular data (pandas.DataFrame) where the data is pre-processed in such a way that it becomes fair according to a user-given fairness metric. The pre-processing approach is fairness-agnostic, enabling the optimization of different fairness criteria. The framework is able to deal with non-binary protected attributes such as nationality, race, and gender that arise in real-world datasets. Due to the possibility of choosing between any of the available fairness metrics, it is possible to aim for the least fortunate group (Rawls' A Theory of Justice [2]) or the general utility of all groups (Utilitarianism).

How does it work?

The pre-processing methods (fairdo.preprocessing.HeuristicWrapper and fairdo.preprocessing.DefaultPreprocessing) work by removing discriminatory data points. By doing so, the dataset becomes much more balanced and less biased towards a particular social group. We approach this task as a combinatorial optimization problem, which means selecting a subset of the dataset that minimizes the discrimination score. Because there are exponentially many possibilities for selecting a subset, our approach uses genetic algorithms to find a fair subset.

Advices

:rocket: For a quick start, use the DefaultPreprocessing class with the default settings. An example is given in tutorials/1. Default Preprocessor and below.

:white_check_mark: For data integrity, you want to keep the original data $D$ and only add fair synthetic data $G$ on top of it. (We include examples in tutorials/ where we use the (SDV)[https://github.com/sdv-dev/SDV] package to generate synthetic data.) For this, you need to specify .fit_transform(approach='add') for the pre-processors fairdo.preprocessing.HeuristicWrapper and fairdo.preprocessing.DefaultPreprocessing. This will only add the fair pre-processed synthetic data to the original data.

:dash: When having limited data, we advise employing synthetic data $G$ additionally and merge it with the original data $D$, i.e., $D \cup G$. The pre-processor can then be used on the merged data $D \cup G$ to ensure fairness. It is also possible to use the methodology for data integrity, as described above.

:briefcase: When anonymity is required, only use synthetic data $G$ and do not merge it with the original data $D$. The generated data $G$ can then be pre-processed with our methods to ensure fairness.

Installation

Dependencies

Python (==3.8), numpy, scipy, pandas, sklearn

Setup Python Environment

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
# On Windows:
.venv\Scripts\activate

# On macOS and Linux:
source .venv/bin/activate

PyPI Distribution (recommended)

The package is distributed via PyPI and can be directly installed with:

pip install fairdo

Manual Installation (latest version)

To install the latest version, execute following commands:

# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git

# Move to repo folder
cd fairdo

# Install from source
python setup.py install

Development Installation

Installing in development mode is useful to make changes in the source code take effect instantly. This means that the package is installed in such a way that changes to the source code are immediately reflected without the need to reinstall the package. This can be done in the following way:

# Clone repo
git clone https://github.com/mkduong-ai/fairdo.git

# Move to repo folder
cd fairdo

# Development installation
pip install -e.

Install Optional Dependencies

To use the synthetic data generation, you can install the SDV package by executing the following command:

pip install sdv==1.10.0

We did not include the SDV package as a dependency, because it is not required for the core functionality of the FairDo package. Using any other synthetic data generation package is also possible. Still, some examples in the tutorials/ folder use the SDV package.

Example Usage

In the following example, we use the COMPAS [1] dataset. The protected attribute is race and the label is recidivism. Here, we deploy the default pre-processor, which internally uses a genetic algorithm, to remove discriminatory samples of the given dataset. The default pre-processor prevents removing all individuals of a single group.

# fairdo package
from fairdo.utils.dataset import load_data
from fairdo.preprocessing import DefaultPreprocessing
# fairdo metrics
from fairdo.metrics import statistical_parity_abs_diff_max

# Loading a sample dataset with all required information
# data is a pandas.DataFrame
data, label, protected_attributes = load_data('compas', print_info=False)

# Initialize DefaultPreprocessing object
preprocessor = DefaultPreprocessing(protected_attribute=protected_attributes[0],
                                    label=label)

# Fit and transform the data
data_fair = preprocessor.fit_transform(dataset=data)

# Print no. samples and discrimination before and after
disc_before = statistical_parity_abs_diff_max(data[label],
                                              data[protected_attributes[0]].to_numpy())
disc_after = statistical_parity_abs_diff_max(data_fair[label],
                                             data_fair[protected_attributes[0]].to_numpy())
print(len(data), disc_before)
print(len(data_fair), disc_after)

By running this example, the resulting dataset usually has a statistical disparity score of <1% (max. score between all five races), while the original dataset exhibits 27% statistical disparity.

Documentation

The documentation is available at https://fairdo.readthedocs.io/en/latest/. To build the documentation manually, follow this guide:

The package follows the PEP8 style guide and is documented with NumPy style DocStrings. To build the HTML pages from the documentation manually, follow these instructions:

Activate virtual environment and install sphinx.

# Activate the virtual environment
# On Windows:
.venv\Scripts\activate

# On macOS and Linux:
source .venv/bin/activate

# Install Sphinx and a required theme
pip install sphinx furo

Run document generation script in UNIX-Systems:

# Move to /docs
cd docs

# Run script to generate documentation
bash generate_docs.sh

The HTML pages are then located in docs/_build/html. Open docs/_build/html/index.html to view the front page.

Citation

When using FairDo in your work, cite our paper:

@inproceedings{duong2023framework,
  title={Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes},
  author={Duong, Manh Khoi and Conrad, Stefan},
  booktitle={The 21st Australasian Data Mining Conference 2023},
  year={2023},
  organization={Springer Nature}
}

References

[1] Larson, J., Angwin, J., Mattu, S., Kirchner, L.: Machine bias (May 2016), https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

[2] Rawls, J.: A Theory of Justice (1971), Belknap Press, ISBN: 978-0-674-00078-0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.10

Jul 8, 2024

0.1.8

Mar 21, 2024

0.1.7

Mar 6, 2024

0.1.6

Mar 5, 2024

This version

0.1.5

Feb 23, 2024

0.1.4

Feb 22, 2024

0.1.3

Feb 20, 2024

0.1.2

Nov 15, 2023

0.1.1

Oct 15, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fairdo-0.1.5.tar.gz (33.0 kB view hashes)

Uploaded Feb 23, 2024 Source

Built Distribution

fairdo-0.1.5-py3-none-any.whl (39.8 kB view hashes)

Uploaded Feb 23, 2024 Python 3

Hashes for fairdo-0.1.5.tar.gz

Hashes for fairdo-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`8a1c0f403d3a3530b9a48488ee3d08ad738cbffcb404c16c0cd9d2e4aefe9b09`
MD5	`3f79ec2f784e4ab5341fc42c0a852cc3`
BLAKE2b-256	`1ed675bba607a83c95778335d0b044e6723fb96d243e45c509c0b1bdf840aa04`

Hashes for fairdo-0.1.5-py3-none-any.whl

Hashes for fairdo-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31591acceed96e2d6f679d3d0f9a6bc80c108101752a75242c193a2c24c0653c`
MD5	`6b70cf89cc22f6a681d2f20b9810b0b6`
BLAKE2b-256	`e0aa9e3db3c5a9c6616e98f3b54db40d252dbf77667acf6b3f25910346dfb689`