Skip to main content

Calvin's Data Science Toolbox

Project description

CDST (Calvin's Data Science Toolbox)

CDST is a collection of data science Python library developed by Calvin Chan at DSAA, Bayer Pharmaceutical. It contains various data science toolsets mostly based on deep learning technique:

  • General Scalable Deep Learning Fully Connected Network (DNN)
  • Calvin's Scalable Parallel Downsampler (CSPD)
  • Ordinal Hyperplane Loss Classifier (OHPL)

The above algorithms are written to deal with positive output data, updates will be made in the future to accomodate real number upon requests.

This package allows users to sample the network architecture based on sampling parameter, the architecture sampling function is included in this package. The architecture sampling parameter is used as hyperparameter and the user can sample the network architecture based on: (1) a given number of neutrons or (2) a given number of model parameters. In the case of using a given number of model parameters, the sample is computed based on Mixed-Integer Nonlinear Programming Model using the GEKKO package. The accuracy/error of the given set of hyperparameter is estimated using k-fold cross validation, the accuracy/error of each of the k-fold is returned for statistical analysis.

All deep learning modules in this package are designed based on the Ray Tune hyperparameter tuning package, user can sample the multi-layer network neuron distribution using the provided architecture sampling function, together with the range of other hyperparameters including: learning rate, batch size, dropout probability.

Design examples are shown in the "example" folder with detail structure and graphical illustration of each module. Users can follow these examples and adjust accordingly to suit their own use case and to better understand the mechanics behind the package.

Hyperparameter Tunning

DNN

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per layer:

CSPD

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per subgroup:

    • neuron per layer:

OHPL

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per layer:

Custom Sampling Function

split_sampling(num_ele, num_layers=None, n_min=1, n_max=None, n_samples=1, prepend=[], postpend=[], single_sample=False)

   num_ele: Total number of elements to be distributed
   n_min: Minimum number of elements per output dimension
   n_max: Maximum number of elements per output dimension
   num_layers: Number of layers to distribute the element, random dimensions will be given with None given

parameters_sampling(num_params, num_layers, in_dim, out_dim=1, n_min=1, n_max=None, n_samples=1, include_inout=True, single_sample=False, max_trials=1000)

   num_params: Total number of parameters to be distributed
   num_layers: Total number of layers
   n_min:      Minimum number of neurons per layer
   n_max:      Maximum number of neurons per layer
   in_dim:     Number of neurons at the input layer
   out_dim:    Number of neurons at the output layer
   n_samples:  Number of architecture samples to return (maximum number of samples return if there are less than demanded)
   include_inout: Flag indicate whether to include input and output layer neurons with samples
   max_trials: Maximum number of randomized trial for solution sampling if not enough samples found

Installation

Use the package manager pip to install CDST.

pip install git+https://github.com/Bayer-Group/cdst.git

Usage

import cdst

Contributing

For major changes, please open an issue first to discuss what you would like to change. For collaborative development, please initiate developement branch in the git repository and submit for approval prior merging into the master branch.

Please make sure to update tests as appropriate.

License

BSD-3-Clause License

Written by Calvin W.Y. Chan calvin.chan@bayer.com, March 2022 (Github: https://github.com/calvinwy, Linkedin: https://www.linkedin.com/in/calchan/)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdst-0.1.tar.gz (24.4 kB view hashes)

Uploaded Source

Built Distribution

cdst-0.1-py3-none-any.whl (29.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page