Skip to main content

DIGEN: Diverse Generative ML Benchmark

Project description

What is DIGEN?

Diverse and Generative ML benchmark (DIGEN) is a modern machine learning benchmark, which includes:

  • 40 datasets in tabular numeric format specially designed to differentiate the performance of some of the leading Machine Learning (ML) methods, and
  • a package to perform reproducible benchmarking that simplifies comparison of performance of the methods.

DIGEN provides comprehensive information on the datasets, including:

  • ground truth - a mathematical formula presenting how the endpoint was generated for each of the datasets
  • the results of exploratory analysis, which includes feature correlation and histogram showing how binary endpoint was calculated.
  • multiple statistics on the datasets, including the AUROC, AUPRC and F1 scores
  • each dataset comes with Reveiver-Operating Characteristics (ROC) and Precision-Recall (PRC) charts for tuned ML methods,
  • a boxplot with projected performance of the leading methods after hyper-parameter tuning (100 runs of each method started with different random seed)

Apart from providing a collection of datasets and tuned ML methods, DIGEN provides tools to easily tune and optimize parameters of any novel ML method, as well as visualize its performance in comparison with the leading ones. DIGEN also offers tools for reproducibility.

Dependencies

The following packages are required to use DIGEN:

pandas>=1.05
numpy>=1.19.5
optuna>=2.4.0
scikit-learn>=0.22.2
importlib_resources

Installing DIGEN

The best way to install DIGEN is using pip, e.g. as a user:

pip install -U digen

Using DIGEN

A non-peer reviewed paper is available at https://arxiv.org/pdf/2107.06475.pdf

Apart from the datasets, DIGEN provides a comprehensive toolbox for analyzing the performance of a chosen ML method. DIGEN uses Optuna, a state of the art framework for optimizing hyper-parameters

Please refer to our online documentation at https://epistasislab.github.io/digen

Citing DIGEN

If you found this resource to be helpful, please cite it the following way:

@article{orzechowski2021generative,
  title={Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers},
  author={Orzechowski, Patryk and Moore, Jason H},
  journal={arXiv preprint arXiv:2107.06475},
  year={2021}
}

Tutorials

DIGEN Tutorial is a great place to start exploring our package. For advanced use, e.g. customization, manipulations with the charts, additional statistics on the collection, please check our Advanced Tutorial.

Included ML classifiers:

The following methods were included in our benchmark:

  • Decision Tree
  • Gradient Boosting
  • K-Nearest Neighbors
  • LightGBM
  • Logistic Regression
  • Random Forest
  • SVC
  • XGBoost

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digen-0.0.5.tar.gz (181.8 kB view details)

Uploaded Source

Built Distribution

digen-0.0.5-py3-none-any.whl (191.6 kB view details)

Uploaded Python 3

File details

Details for the file digen-0.0.5.tar.gz.

File metadata

  • Download URL: digen-0.0.5.tar.gz
  • Upload date:
  • Size: 181.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for digen-0.0.5.tar.gz
Algorithm Hash digest
SHA256 ca8c7b27bd15ce4021d0ae8abce64e87be3e8717e7e40fe75ee5db0a1de79c2b
MD5 d92b0cbe065d82194ffcf26061f51cf8
BLAKE2b-256 ba8ea207450a6be2416f9f81719bb8bce72f0e99c155ae42c1a1c31074a30243

See more details on using hashes here.

File details

Details for the file digen-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: digen-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 191.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.10

File hashes

Hashes for digen-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4d46d20323d3afb8575d0e7cce3abe97f303e8ee5a5f60922331c343d9f678cf
MD5 5187a07e863ccf95fc3a7840a7774bc8
BLAKE2b-256 c5c620ad75ee16ef8a64d7d9ee40cb8dc97865be6f57dc5cac5e624c706a64db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page