Skip to main content

Generates complex, nonlinear datasets for use with deep learning/black box models

Project description

synthetic-data

Inspired by sklearn.datasets.make_classification, which in turn is based on work for the NIPS 2003 feature selection challenge [1] - targeting linear classifiers. Here the focus is on generating more complex, nonlinear datasets appropriate for use with deep learning/black box models which 'need' nonlinearity - otherwise you would/should use a simpler model.

Approach

Ideally, the method would provide a concise specification to generate tabular data with sensible defaults. The specification should provide knobs that the end user can dial up or down to see it's downstream impact.

Copulas are a model for specifying the joint probability p(x1, x2, ..., xn) given a correlation structure along with specifications for the marginal distribution of each feature. The current implementation uses a multivariate normal distribution with specified covariance matrix. Future work can expand this choice to other multivariate distributions.

Parameters

name type default description
n_samples int (default=100) The number of samples.
n_informative int (default=2) The number of informative features - these should all be represented in the symbolic expression used to generate y_reg
n_nuisance int (default=0) The number of nuisance features - these should not be included in the symbolic expression - and hence have no role in the DGP.
n_clases int (default=2) the number of classes
dist list a list of the marginal distributions to apply to the features/columns
cov matrix a square numpy array with dimensions (??? x ???) - should be n_total where n_total=n_informative + n_nuisance
expr str an expression providing y = f(X)
sig_k float (default=1.0) the steepness of the sigmoid used in mapping y_reg to y_prob
sig_x0 float (default=None) the center point of the sigmoid used in mappying y_reg to y_prob
p_thresh float (default=0.5) threshold probability that determines boundary between classes
noise_level_x float (default=0.0) level of Gaussian white noise to apply to X
noise_level_y float (default=0.0) level of Gaussian white noise to apply to y_label (like flip_y)

Getting Started

Local Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Tests

Test/Lint Dependencies

pip install -r requirements-test.txt

To run tests:

make test_local

Pre-Commit

To install pre-commit hooks, run the following commands:

pre-commit install
pre-commit run

Referencing this library

If you use this library in your work, please cite our paper:

@inproceedings{barr:2020,
  author    = {Brian Barr and Ke Xu and Claudio Silva and Enrico Bertini and Robert Reilly and  C. Bayan Bruss and Jason D. Wittenbach},
  title     = {{Towards Ground Truth Explainability on Tabular Data}},
  year      = {2020},
  maintitle = {International Conference on Machine Learning},
  booktitle = {2020 ICML Workshop on Human Interpretability in Machine Learning (WHI 2020)},
  date = {2020-07-17},
  pages = {362-367},
}

Notes

If you have tabular data, and want to fit a copula from it, consider this python library: copulas Quick visual tutorial of copulas and probability integral transform.

To run the examples, you should run:

$ python -m pip install pandas pytest pytest-cov seaborn shap tensorflow "DataProfiler[full]"

References

[1] Guyon, “Design of experiments for the NIPS 2003 variable selection benchmark”, 2003.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synthetic-data-1.2.3.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

synthetic_data-1.2.3-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file synthetic-data-1.2.3.tar.gz.

File metadata

  • Download URL: synthetic-data-1.2.3.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for synthetic-data-1.2.3.tar.gz
Algorithm Hash digest
SHA256 eeef55fe2fb2f3d3a1e71861817ebad5cbb2b9df8cb915fc98fbddbbd70b92ae
MD5 1fc9c17d8fb9d0400936850c99524065
BLAKE2b-256 7b043573d5bff92a3f84a74a8c39261f38e5db75537d3f190f56d15da5f80426

See more details on using hashes here.

File details

Details for the file synthetic_data-1.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for synthetic_data-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 de0a351b001d862b4c40eb145ea6cd210d9d53ffef90b8803c6baab76d4ecf14
MD5 e85219ad37f0c49fe5a78bf8ef15e790
BLAKE2b-256 6b315f59e7ff26a22213c2897d9f03ebeaad33465d726574d229a40d1011512c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page