Skip to main content

Generates complex, nonlinear datasets for use with deep learning/black box models

Project description

synthetic-data

Inspired by sklearn.datasets.make_classification, which in turn is based on work for the NIPS 2003 feature selection challenge [1] - targeting linear classifiers. Here the focus is on generating more complex, nonlinear datasets appropriate for use with deep learning/black box models which 'need' nonlinearity - otherwise you would/should use a simpler model.

Approach

Ideally, the method would provide a concise specification to generate tabular data with sensible defaults. The specification should provide knobs that the end user can dial up or down to see it's downstream impact.

Copulas are a model for specifying the joint probability p(x1, x2, ..., xn) given a correlation structure along with specifications for the marginal distribution of each feature. The current implementation uses a multivariate normal distribution with specified covariance matrix. Future work can expand this choice to other multivariate distributions.

Parameters

name type default description
n_samples int (default=100) The number of samples.
n_informative int (default=2) The number of informative features - these should all be represented in the symbolic expression used to generate y_reg
n_nuisance int (default=0) The number of nuisance features - these should not be included in the symbolic expression - and hence have no role in the DGP.
n_clases int (default=2) the number of classes
dist list a list of the marginal distributions to apply to the features/columns
cov matrix a square numpy array with dimensions (??? x ???) - should be n_total where n_total=n_informative + n_nuisance
expr str an expression providing y = f(X)
sig_k float (default=1.0) the steepness of the sigmoid used in mapping y_reg to y_prob
sig_x0 float (default=None) the center point of the sigmoid used in mappying y_reg to y_prob
p_thresh float (default=0.5) threshold probability that determines boundary between classes
noise_level_x float (default=0.0) level of Gaussian white noise to apply to X
noise_level_y float (default=0.0) level of Gaussian white noise to apply to y_label (like flip_y)

Getting Started

Local Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Tests

Test/Lint Dependencies

pip install -r requirements-test.txt

To run tests:

make test_local

Pre-Commit

To install pre-commit hooks, run the following commands:

pre-commit install
pre-commit run

Referencing this library

If you use this library in your work, please cite our paper:

@inproceedings{barr:2020,
  author    = {Brian Barr and Ke Xu and Claudio Silva and Enrico Bertini and Robert Reilly and  C. Bayan Bruss and Jason D. Wittenbach},
  title     = {{Towards Ground Truth Explainability on Tabular Data}},
  year      = {2020},
  maintitle = {International Conference on Machine Learning},
  booktitle = {2020 ICML Workshop on Human Interpretability in Machine Learning (WHI 2020)},
  date = {2020-07-17},
  pages = {362-367},
}

Notes

If you have tabular data, and want to fit a copula from it, consider this python library: copulas Quick visual tutorial of copulas and probability integral transform.

To run the examples, you should run:

$ python -m pip install pandas pytest pytest-cov seaborn shap tensorflow "DataProfiler[full]"

References

[1] Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-data-1.2.3.tar.gz (2.4 MB view hashes)

Uploaded Source

Built Distribution

synthetic_data-1.2.3-py3-none-any.whl (27.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page