Sample Implementation of Gradient Feature Auditing (GFA)
Project description
# Black Box Auditing and Certifying and Removing Disparate Impact
This repository contains a sample implementation of Gradient Feature Auditing (GFA) meant to be generalizable to most datasets. For more information on the repair process, see our paper on [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756). For information on the full auditing process, see our paper on [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043).
# License
This code is licensed under an [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) license.
# Certifying and Removing Disparate Impact
After installing BlackBoxAuditing, you can run the data repair described in [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756) using the command `BlackBoxAuditing-repair` on a terminal which will tell you the arguments the script takes.
# Black Box Auditing
To run GFA on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)),
## Running as a Python Script
After installing BlackBoxAuditing, GFA can be run on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)) using a simple python script. For reference, the following includes sample code:
```python
%python
# import BlackBoxAuditing
import BlackBoxAuditing as BBA
# import machine learning technique
from BlackBoxAuditing.model_factories import Weka_SVM, Weka_DecisionTree
"""
Using a preloaded dataset
"""
# load in preloaded dataset
data = BBA.load_data("german")
# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_SVM
# call the auditor with the data
auditor(data)
"""
Using your own dataset
"""
# load your own data
datafile = 'path/to/datafile'
data = BBA.load_from_file(datafile)
# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_DecisionTree
# call the auditor
auditor(data)
```
### More Advanced Script Options
#### Using a preloaded dataset
The BlackBoxAuditing package has a few datasets preloaded and ready to use for auditing. In a script, they are available via the function `load_data` which takes as input the name of the dataset and returns formatted data ready for auditing. The following is the list of preloaded datasets available for auditing:
* adult
* diabetes
* ricci
* german
* glass
* sample
* DRP
Refer to the Sources section down below for more information about the datasets
#### Using you own dataset
To use your own data for auditing, the function `load_from_file`, most simply, takes as input the path to your dataset and returns formatted data ready for auditing. `load_from_file` also includes other paramters which should be set to ensure that your data is processed correctly. Refer to the full function and its defaults:
```
load_from_file(datafile, testdata=None, correct_types=None, train_percentage=2.0/3.0,
response_header=None, features_to_ignore=None, missing_data_symbol=""
```
* *datafile*: path to your dataset
* *testdata*: path to the dataset used for testing a model. Assumes that *datafile* is the training dtata
* *correct_types*: list of the types (str, int, or float) of the features in the data. If not given, the types will be automatically generated by inspecting the values of each feature
* *train_percentage*: train/test split of the data given as floats
* *response_header*: name of the response column in the data. if not given, assumes that the last column in the data is the response
* *features_to_ignore*: list of the names of any feature than you wish to be ignored by the model
* *missing_data_symbol*: symbol that marks missing or unknown value in the data
#### Auditor options
After initializing the auditor `auditor = BlackBoxAuditor.Auditor()`, there are a few options that can be set to tune the auditor listed as follows:
`auditor.measurers`: (*default = [accuracy, BCR]*) list of measurers to use for GFA
`auditor.model_options`: (*default = {}*) options for machine learning model
`auditor.verbose`: (*default = True*) Set to "True" to allow for more detailed status updates
`auditor.REPAIR_STEPS`: (*default = 10*) Number of repair steps take
`auditor.RETRAIN_MODEL_PER_REPAIR`: (*default = False*)
`auditor.WRITE_ORIGINAL_PREDICTIONS`: (*default = True*)
`auditor.ModelFactory`: (*default = Weka_SVM*) Available machine learning options: Weka_SVM, Weka_DecisionTree, TensorFlow
`auditor.kdd`: (*default = False*)
## Testing Code Changes
After BlackBoxAuditing has been installed, you can run the test suite using the command on a terminal `BlackBoxAuditing-test`.
Every python file should include test functions at the bottom that will be run when the file is run. This can be done by including the line `if __name__=="__main__": test()` as long as there is a function defined as `test`.
These tests should use print statements with `True` or `False` readouts indicating success or failure (where `True` should always be success). It is fine/good to have multiple of these per file.
Note: if a test requires reading data from the `test_data` directory, it should import the appropriate `load_data` file from the `experiments` directory.
## Implementing a New Machine-Learning Method
The best way to create a model would be to use a ModelFactory and ModelVisitors. A ModelVisitor should be thought of as a wrapper that knows how to load a machine-learning model of a given type and communicate with that model file in order to output predicted values of some test dataset. A ModelFactory simply knows how to "build" a ModelVisitor based on some provided training data. Check out the "Abstract" files in the `sample_experiment` directory for outlines of what these two classes should do; similarly, check out the "SVM_ModelFactory" files in the `sample_experiment` subdirectory for examples that use WEKA to create model files and produce predictions.
## Setup and Installation
1. Install WEKA and/or Tensorflow (see below).
2. Update the WEKA path in `model_factories/AbstractWekaModelFactory.py`.
3. Install the Python dependencies listed in the requirements.txt file.
4. Install python-matplotlib if you do not already have it (`sudo apt-get install python-matplotlib`).
5. Install BlackBoxAuditing (`pip install BlackBoxAuditing`)
Many of the ModelVisitors rely on [Weka](http://www.cs.waikato.ac.nz/ml/weka/). Similarly, we use [TensorFlow](https://www.tensorflow.org/) for network-based machine learning. Any Python libraries that need to be installed are included in the `requirements.txt` file.
- Weka 3.6.13 [download](http://www.cs.waikato.ac.nz/ml/weka/downloading.html)
- TensorFlow [download](https://www.tensorflow.org/versions/master/get_started/os_setup.html) (original experiments run with version 0.6.0)
# Sources
Dataset Sources:
- adult.csv [link](https://archive.ics.uci.edu/ml/datasets/Adult)
- german_categorical.csv (Modified from [link](https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
- RicciDataMod.csv (Modified from [link](http://www.amstat.org/publications/jse/v18n3/RicciData.csv))
- DRP Datasets (Source and data-files coming soon.)
- Arrests/Recidivism Datasets [link](http://www.icpsr.umich.edu/icpsrweb/RCMD/studies/3355)
- Linear Datasets ("sample_2" Experiment) [link](https://github.com/jasonbaldridge/try-tf)
More information on DRP can be found at the [Dark Reactions Project](http://darkreactions.haverford.edu/) official site.
# Bug Reports and Feature-Requests
All bug reports and feature-requests should be submitted through the [Issue Tracker](https://github.com/cfalk/BlackBoxAuditing/issues).
This repository contains a sample implementation of Gradient Feature Auditing (GFA) meant to be generalizable to most datasets. For more information on the repair process, see our paper on [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756). For information on the full auditing process, see our paper on [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043).
# License
This code is licensed under an [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) license.
# Certifying and Removing Disparate Impact
After installing BlackBoxAuditing, you can run the data repair described in [Certifying and Removing Disparate Impact](http://arxiv.org/abs/1412.3756) using the command `BlackBoxAuditing-repair` on a terminal which will tell you the arguments the script takes.
# Black Box Auditing
To run GFA on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)),
## Running as a Python Script
After installing BlackBoxAuditing, GFA can be run on a dataset (as in [Auditing Black-box Models for Indirect Influence](http://arxiv.org/abs/1602.07043)) using a simple python script. For reference, the following includes sample code:
```python
%python
# import BlackBoxAuditing
import BlackBoxAuditing as BBA
# import machine learning technique
from BlackBoxAuditing.model_factories import Weka_SVM, Weka_DecisionTree
"""
Using a preloaded dataset
"""
# load in preloaded dataset
data = BBA.load_data("german")
# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_SVM
# call the auditor with the data
auditor(data)
"""
Using your own dataset
"""
# load your own data
datafile = 'path/to/datafile'
data = BBA.load_from_file(datafile)
# initialize the auditor and set parameters
auditor = BBA.Auditor()
auditor.model = Weka_DecisionTree
# call the auditor
auditor(data)
```
### More Advanced Script Options
#### Using a preloaded dataset
The BlackBoxAuditing package has a few datasets preloaded and ready to use for auditing. In a script, they are available via the function `load_data` which takes as input the name of the dataset and returns formatted data ready for auditing. The following is the list of preloaded datasets available for auditing:
* adult
* diabetes
* ricci
* german
* glass
* sample
* DRP
Refer to the Sources section down below for more information about the datasets
#### Using you own dataset
To use your own data for auditing, the function `load_from_file`, most simply, takes as input the path to your dataset and returns formatted data ready for auditing. `load_from_file` also includes other paramters which should be set to ensure that your data is processed correctly. Refer to the full function and its defaults:
```
load_from_file(datafile, testdata=None, correct_types=None, train_percentage=2.0/3.0,
response_header=None, features_to_ignore=None, missing_data_symbol=""
```
* *datafile*: path to your dataset
* *testdata*: path to the dataset used for testing a model. Assumes that *datafile* is the training dtata
* *correct_types*: list of the types (str, int, or float) of the features in the data. If not given, the types will be automatically generated by inspecting the values of each feature
* *train_percentage*: train/test split of the data given as floats
* *response_header*: name of the response column in the data. if not given, assumes that the last column in the data is the response
* *features_to_ignore*: list of the names of any feature than you wish to be ignored by the model
* *missing_data_symbol*: symbol that marks missing or unknown value in the data
#### Auditor options
After initializing the auditor `auditor = BlackBoxAuditor.Auditor()`, there are a few options that can be set to tune the auditor listed as follows:
`auditor.measurers`: (*default = [accuracy, BCR]*) list of measurers to use for GFA
`auditor.model_options`: (*default = {}*) options for machine learning model
`auditor.verbose`: (*default = True*) Set to "True" to allow for more detailed status updates
`auditor.REPAIR_STEPS`: (*default = 10*) Number of repair steps take
`auditor.RETRAIN_MODEL_PER_REPAIR`: (*default = False*)
`auditor.WRITE_ORIGINAL_PREDICTIONS`: (*default = True*)
`auditor.ModelFactory`: (*default = Weka_SVM*) Available machine learning options: Weka_SVM, Weka_DecisionTree, TensorFlow
`auditor.kdd`: (*default = False*)
## Testing Code Changes
After BlackBoxAuditing has been installed, you can run the test suite using the command on a terminal `BlackBoxAuditing-test`.
Every python file should include test functions at the bottom that will be run when the file is run. This can be done by including the line `if __name__=="__main__": test()` as long as there is a function defined as `test`.
These tests should use print statements with `True` or `False` readouts indicating success or failure (where `True` should always be success). It is fine/good to have multiple of these per file.
Note: if a test requires reading data from the `test_data` directory, it should import the appropriate `load_data` file from the `experiments` directory.
## Implementing a New Machine-Learning Method
The best way to create a model would be to use a ModelFactory and ModelVisitors. A ModelVisitor should be thought of as a wrapper that knows how to load a machine-learning model of a given type and communicate with that model file in order to output predicted values of some test dataset. A ModelFactory simply knows how to "build" a ModelVisitor based on some provided training data. Check out the "Abstract" files in the `sample_experiment` directory for outlines of what these two classes should do; similarly, check out the "SVM_ModelFactory" files in the `sample_experiment` subdirectory for examples that use WEKA to create model files and produce predictions.
## Setup and Installation
1. Install WEKA and/or Tensorflow (see below).
2. Update the WEKA path in `model_factories/AbstractWekaModelFactory.py`.
3. Install the Python dependencies listed in the requirements.txt file.
4. Install python-matplotlib if you do not already have it (`sudo apt-get install python-matplotlib`).
5. Install BlackBoxAuditing (`pip install BlackBoxAuditing`)
Many of the ModelVisitors rely on [Weka](http://www.cs.waikato.ac.nz/ml/weka/). Similarly, we use [TensorFlow](https://www.tensorflow.org/) for network-based machine learning. Any Python libraries that need to be installed are included in the `requirements.txt` file.
- Weka 3.6.13 [download](http://www.cs.waikato.ac.nz/ml/weka/downloading.html)
- TensorFlow [download](https://www.tensorflow.org/versions/master/get_started/os_setup.html) (original experiments run with version 0.6.0)
# Sources
Dataset Sources:
- adult.csv [link](https://archive.ics.uci.edu/ml/datasets/Adult)
- german_categorical.csv (Modified from [link](https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
- RicciDataMod.csv (Modified from [link](http://www.amstat.org/publications/jse/v18n3/RicciData.csv))
- DRP Datasets (Source and data-files coming soon.)
- Arrests/Recidivism Datasets [link](http://www.icpsr.umich.edu/icpsrweb/RCMD/studies/3355)
- Linear Datasets ("sample_2" Experiment) [link](https://github.com/jasonbaldridge/try-tf)
More information on DRP can be found at the [Dark Reactions Project](http://darkreactions.haverford.edu/) official site.
# Bug Reports and Feature-Requests
All bug reports and feature-requests should be submitted through the [Issue Tracker](https://github.com/cfalk/BlackBoxAuditing/issues).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
BlackBoxAuditing-0.0.1.tar.gz
(1.8 MB
view hashes)