Toolkit for evaluating and enforcing ML model fairness
Project description
OxonFair: An Algorithmic Fairness Toolkit for High-Capacity Models
OxonFair is an expressive toolkit designed to enforce a wide-range of fairness definitions and to customize binary classifier behavior. The toolkit is designed to overcome a range of shortcomings in existing fairness toolkits for high-capacity models that overfit to the training data. Unlike other toolkits it is designed and works for computer vision and NLP problems alongside tabular data.
Check out the colab demo here or read the preprint.
For low-capacity models (e.g., logistic regression over a small number of variables, and decision-trees of limited depth), we recommend fairlearn.
We support a range of complex classifiers including pytorch, scikit learn, and ensembles provided by autogluon.
OxonFair is a continuation of early work done at autogluon.fair and actively maintained.
Source install
Standard install
In the terminal type:
pip install 'oxonfair[full]'
This will download and install enough code to run any notebooks except those comparing with fairlearn. This includes autogluon, pytorch, and XGBoost. If this is too many dependencies, try a minimal install.
Minimal install
In the terminal type:
pip install oxonfair
By default, this will only install the necessary dependencies sklearn; pandas; and numpy. You will have to install matplotlib
to plot.
Compare with Fairlearn
In the terminal type:
pip install 'oxonfair[notebooks]'
This will download enough supporting libraries, including fairlearn, to run all the notebooks.
Full install for running the test suite
Download the source of OxonFair and in the source directory run:
pip install -e .\[tests\]
You probably don't want to install this unless you're looking to modify the codebase.
Examples
Now run the Example Notebook or try some of the example below.
For scikit/XGBoost, see sklearn.md and the Example Notebook
For pytorch, see a toy example on adult and for computer vision, this Example Notebook
More demo notebooks are present in the examples folder.
Demo using XGBoost
# Load libraries
from oxonfair import dataset_loader, FairPredictor
from oxonfair import group_metrics as gm
import xgboost
# Download and partition the adult dataset into training and test datta
train_data, _, test_data = dataset_loader.adult('sex', train_proportion=0.7, test_proportion=0.3)
# Train an XGBoost classifier on the training set
predictor = xgboost.XGBClassifier().fit(X=train_data['data'], y=train_data['target'])
# Specify that we want to create a fair predictor by reusing the training set
# (at the risk of overfitting) to enforce fairness with respect to sex.
fpredict = FairPredictor(predictor, train_data, 'sex')
# Enforce demographic parity to within 2%
fpredict.fit(gm.accuracy, gm.demographic_parity, 0.02)
# Evaluate per_group on the test set
fpredict.evaluate_groups(test_data)
Overview
OxonFair is a postprocessing approach for enforcing fairness, with support for a wide range of performance metrics and fairness criteria, and support for inferred attributes, i.e., it does not require access to protected attributes at test time. Under the hood, FairPredictor works by adjusting the decision boundary for each group individually. Where groups are not available, it makes use of inferred group membership to adjust decision boundaries.
The key idea underlying this toolkit is that for a wide range of use cases, the most suitable classifier should do more than maximize some form of accuracy. We offer a general toolkit that allows different measures to be optimized and additional constraints to be imposed by tuning the behavior of a binary predictor on validation data.
For example, classifiers can be tuned to maximize performance for a wide range of metrics such as:
- Accuracy
- Balanced Accuracy
- F1 score
- MCC
- Custom utility functions
While also approximately satisfying a wide range of group constraints such as:
- Demographic Parity (The idea that positive decisions should occur at the same rates for all protected groups, for example for men at the same rate as for women)
- Equal Opportunity (The recall should be the same for all protected groups)
- Minimum Recall Constraints (The recall should be above a particular level for all groups)
- Minimum Precision Constraints (The precision should be above a particular level for all groups)
- Custom Fairness Metrics
The full set of constraints and objectives can be seen in the list of measures in measures.md.
Why Another Fairness Library?
Fundamentally, most existing fairness methods are not appropriate for use with complex classifiers on high-dimensional data. These classifiers are prone to overfitting on the training data, which means that trying to balance error rates (e.g., when using equal opportunity) on the training data, is unlikely to transfer well to new unseen data. This is a particular problem when using computer vision (see Zietlow et al.), but can also occur with tabular data. Moreover, iteratively retraining complex models (a common requirement of many methods for enforcing fairness) is punitively slow when training the model once might take days, or even weeks, if you are trying to maximize performance.
At the same time, postprocessing methods which allow you to train once, and then improve fairness on held-out validation data generally requires the protected attributes to be available at test time, which is often infeasible, particularly with computer vision.
OxonFair is build from the ground up to avoid these issues. It is a postprocessing approach, explicitly designed to use inferred attributes where protected attributes are not available to enforce fairness. Fairness can be enforced both on validation, or on the train set, when you are short of data and overfitting is not a concern. When enforcing fairness in deep networks or using provided attributes, a classifier is only trained once, for non network-based approaches, e.g., scikit-learn or xgboost, with inferred attributes we require the training of two classifier (one to predict the original task, and a second to estimate groups membership).
That said, we make several additional design decisions which we believe make for a better experience for data scientists:
Direct support for pytorch including NLP and Computer Vision
See here for an example. In brief, the steps are:
- Train your network.
- Call
DeepFairPredictor
with the network output on validation data. - Call
fit
to enforce fairness. Useevaluate_groups
andplot_frontier
to explore trade-offs. - Use
merge_heads_pytorch
to generate a fair network.
Fine-grained control of behavior
Wide Choice of performance measure
Unlike other approaches to fairness, FairPredictor allows the optimization of arbitrary performance measures such as F1 or MCC, subject to fairness constraints. This can substantially improve the fairness/performance trade-off with, for example, F1 scores frequently being 3-4% higher when directly optimized for rather than accuracy.
Wide Choice of Fairness Measures
Rather than offering a range of different fairness methods that enforce a small number of fairness definitions through a variety of different methods, we offer one method that can enforce a much wider range of fairness definitions out of the box, alongside support for custom fairness definitions.
Of the set of decision-based group-metrics discussed in Verma and Rubin, and the metrics measured by Sagemaker Clarify, out of the box FairPredictor offers the ability to both measure and enforce all of the 8 group metrics used to evaluate classifier decision measured in Verma and Rubin, and all 12 group measures used to evaluate decisions in Clarify.
Direct Remedy of Harms
See this paper for the problems this addresses.
An Ipython notebook generating many of the figures in the paper can be found here: Levelling up notebook
Many fairness measures can be understood as identifying a harm, and then equalizing this harm across the population as a whole. For example, the use demographic parity of identifies people as being harmed by a low selection rate, which is then set to be the same for all groups, while equal opportunity identifies people as being harmed by low recall, and balances this harm across all groups. However, these fairness formulations often explicitly
As an alternative to equalizing the harm across the population, we allow data scientists to specify minimum rates of e.g., recall, precision, or selection rate for every group, with one line of code. E.g.
fpredictor.fit(gm.accuracy, gm.precision.min, 0.5)
finds a classifier that maximizes accuracy while having a precision of at least 0.5 for every group.
These constraints have wider uses outside of fairness. For example, a classifier can be trained to identify common defects across multiple factory lines, and its behavior can be altered to enforce a recall of 90% of all defects for each line (at a cost of increased false positives).
fpredictor.fit(gm.accuracy, gm.recall.min, 0.9)
Support for Utility based approaches
We provide support for the utility based approach set out in Fairness On The Ground: Applying Algorithmic Fairness Approaches To Production Systems, whereby different thresholds can be selected per group to optimize a utility-based objective.
Utility functions can be defined in one line.
For example, consider a situation where an ML system identifies potential problems that require intervening. Every intervention may have a cost of 1, regardless of if it was needed, but a missed intervention that was needed has a cost of 5. Finally, not making an intervention when one was unneeded has a cost of 0. This can be written as:
my_utility=gm.Utility([1, 1, 5, 0], 'Testing Costs')
and optimized alongside fairness or performance constraints. For example,
fpredictor.fit(my_utility)
optimizes the utility, while
fpredictor.fit(my_utility, gm.accuracy, 0.5)
optimizes the utility subject to the requirement that the classifier accuracy can not drop below 0.5.
Support for user-specified performance and fairness measures
As well as providing support for enforcing a wide range of performance and fairness measures, we allow users to define their own metrics and fairness measures.
For example, a custom implementation of recall can be defined as:
my_recall = gm.GroupMetric(lambda TP, FP, FN, TN: (TP) / (TP + FN), 'Recall')
and then the average difference in recall between groups (corresponding to the fairness definition of Equal Opportunity) is provided by calling my_recall.diff
, and the minimum recall over any group (which can be used to ensure that the recall is above a particular value for every group) is given by my_recall.min
.
Altering Behavior
The behavior of classifiers can be altered through the fit function. Given a pretrained binary predictor, we define a fair classifier that will allow us to alter the existing behavior on a validation dataset, by using a labeled attribute 'sex'.
fpredictor = FairPredictor(predictor, validation_data, 'sex')
the fit function takes three arguments that describe an objective such as accuracy or F1 that should be optimized, and a constraint such as the demographic parity violation should be below 2%.
This takes the form:
fpredictor.fit(gm.f1, gm.demographic_parity, 0.02)
The constraint and objective can be swapped so the code:
fpredictor.fit(gm.demographic_parity, gm.f1, 0.75)
will find the method with the lowest demographic parity violation such that F1 is above 0.75. If, for example, you wish to optimize F1 without any additional constraints, you can just call:
fpredictor.fit(gm.f1)
Note that the default behavior (we should minimize the demographic parity violation, but maximize F1) is inferred from standard usage but can be overwritten by setting the optional parameters greater_is_better_obj
and greater_is_better_const
to True
or False
.
Where constraints cannot be satisfied, for example, if we require that the F1 score must be above 1.1, fit
returns the solution closest to satisfying it.
For more examples using fit see using_fit.md.
Measuring Behavior
A key challenge in deciding how to alter the behavior of classifiers is that these decisions have knock-on effects elsewhere. For example, increasing the precision of classifiers, will often decrease their recall and vice versa.
In the same way, many fairness definitions may be at odds with one another, and increasing the fairness with respect to one definition can decrease it with respect to other definitions.
As such, we offer a range of methods for evaluating the performance of classifiers.
fpredictor.evaluate(data (optional), groups (optional), dictionary_of_methods (optional), verbose=False)
By default, this method reports the standard binary evaluation criteria of autogluon for both the original and updated predictor, over the data used by fit. The behavior can be altered by providing either alternate data or a new dictionary of methods. Where groups are not provided, it will use the same groups as passed to fit
, but this can be altered. If verbose is set to true, the table contains the long names of methods, otherwise it reports the dictionary keys.
fpredictor.evaluate_fairness(data (optional), groups (optional), dictionary_of_methods (optional), verbose=False)
By default, this method reports the standard fairness metrics of SageMaker Clarify for both the original and updated predictor, over the data used by fit. The behavior can be altered by providing either alternate data or a new dictionary of methods. Where groups is not provided, it will use the same groups as passed to fit
, but this can be altered. If verbose is set to true, the table contains the long names of methods, otherwise it reports the dictionary keys.
fpredictor.evaluate_groups(data (optional), groups (optional), dictionary_of_methods (optional), return_original=False, verbose=False)
By default, this method reports, per group, the standard binary evaluation criteria of autogluon for both the updated predictor only, over the data used by fit. The behavior can be altered by providing either alternate data or a new dictionary of methods. Where groups is not provided, it will use the same groups as passed to fit
, but this can be altered. If you wish to also see the per group performance of the original classifier, use return_original=True
to receive a dict containing the per_group performance of the original and updated classifier. If verbose is set to true, the table contains the long names of methods, otherwise it reports the dictionary keys.
Fairness using Inferred Attributes
In many cases, the attribute you wish to be fair with respect to such as sex
may not be available at test time. In this case you can make use of inferred attributes predicted by another classifier. This can be done by defining the fair predictor in the following way.
fpredictor = fair.FairPredictor(predictor, data, 'sex', inferred_groups=attribute_pred)
where attribute_pred
is another autogluon predictor trained to predict the attribute, such as sex
you wish to infer.
Then fpredictor
can be used in same way described above.
Note that the labeled attribute is used to evaluate fairness, and the use of the inferred attributes are tuned to optimize fairness with respect to the labeled attributes. This means that even if the inferred attributes are not that accurate, they can be still used to enforce fairness, albeit with a drop in performance.
To make it easier to use inferred attributes, we provide a helper function:
predictor, attribute_pred = fair.inferred_attribute_builder(train_data, 'class', 'sex')
This allows for the easy training of two tabular classifiers, one called predictor
to predict the target label class
without using the attribute sex
and one called attribute_pred
to predict sex
without using the the target label.
Fairness on COMPAS using Inferred Attributes
We demonstrate how to enforce a wide range of fairness definitions on the COMPAS dataset. This dataset records paroles caught violating the terms of parole. As it measures who was caught, it is strongly influenced by policing and environmental biases, and should not be confused with a measurement of who actually violated their terms of parole. See this paper for a discussion of its limitations and caveats.
See this notebook for details.
Best Practices
It is common for machine learning algorithms to overfit training data. Therefore, if you want your fairness constraints to carry over to unseen data we recommend that they are enforced on a large validation set, rather than the training set. For low-dimensional datasets, many classifiers, with a careful choice of hyperparameter, are robust to overfitting and fairness constraints enforced on training data can carry over to unseen test data. In fact, given the choice between enforcing fairness constraints on a large training set, vs. using a significantly smaller validation set, reusing the training set may result in better generalization of the desired behavior to unseen data. However, this behavior is not guaranteed, and should always be empirically validated.
Challenges with unbalanced data
Many datasets are unbalanced both in the size of protected groups and in the prevalence of positive or negatively labeled data. When a rare group rarely receives positives outcomes, large datasets are needed to correctly estimate the rate of failure per group on positive data. This can make it very hard to reliably enforce or evaluate measures such as equal opportunity or minimum recall on unbalanced datasets, particularly where the baseline classifier has relatively high accuracy. The size and nature of the dataset needs to be carefully considered when choosing a fairness metric.
For example, on the historic dataset 'adult'; African Americans, despite being the second largest ethnicity after Caucasian, only make up around 10% of the dataset, with only 10% of them earning over 50k (N.B. adult is based on census data from 1994, in the same dataset, around 20% of white people earned over 50k). If an algorithm had a 10% error rate on this subset of the data, we are concerned with the behavior of around 10%^3 i.e., 0.1% of the data. This problem becomes even greater when looking at less prevalent groups.
For this reason, reliably guaranteeing high-accuracy across all groups, or that fairness measures are satisfied, requires access to rebalanced datasets, or much larger datasets than are needed for guaranteeing accuracy at the population level.
quickstart_autogluon.ipynb has an example on the adult dataset where demographic parity is only weakly enforced on test data for the smaller groups American-Indian-Eskimo
, and Asian-Pacific-Islander
due to limited sample size.
List of Measures
See measures.md for a full list of fairness and performance measures supported by OxonFair.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file oxonfair-0.2.1.8-py3-none-any.whl
.
File metadata
- Download URL: oxonfair-0.2.1.8-py3-none-any.whl
- Upload date:
- Size: 52.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08c271f4ea70b4221d725989608637de2cac68b8bdf458bf9a0f92b409048e78 |
|
MD5 | 6d54b89275a42cd73c5b58f60511b0f3 |
|
BLAKE2b-256 | cba1a3b89ec8f44b429998f404d9fc3a2d983c2afc7fc28bdcfa2771e01bfe04 |