Skip to main content

Calvin's Data Science Toolbox

Project description

CDST (Calvin's Data Science Toolbox)

CDST is a collection of data science Python library developed by Calvin Chan at DSAA, Bayer Pharmaceutical. It contains various data science toolsets mostly based on deep learning technique:

  • General Scalable Deep Learning Fully Connected Network (DNN)
  • Calvin's Scalable Parallel Downsampler (CSPD)
  • Ordinal Hyperplane Loss Classifier (OHPL)

The above algorithms are written to deal with positive output data, updates will be made in the future to accomodate real number upon requests.

This package allows users to sample the network architecture based on sampling parameter, the architecture sampling function is included in this package. The architecture sampling parameter is used as hyperparameter and the user can sample the network architecture based on: (1) a given number of neutrons or (2) a given number of model parameters. In the case of using a given number of model parameters, the sample is computed based on Mixed-Integer Nonlinear Programming Model using the GEKKO package. The accuracy/error of the given set of hyperparameter is estimated using k-fold cross validation, the accuracy/error of each of the k-fold is returned for statistical analysis.

All deep learning modules in this package are designed based on the Ray Tune hyperparameter tuning package, user can sample the multi-layer network neuron distribution using the provided architecture sampling function, together with the range of other hyperparameters including: learning rate, batch size, dropout probability.

Design examples are shown in the "example" folder with detail structure and graphical illustration of each module. Users can follow these examples and adjust accordingly to suit their own use case and to better understand the mechanics behind the package.

Hyperparameter Tunning

DNN

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per layer:

CSPD

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per subgroup:

    • neuron per layer:

OHPL

  • Use custom sampling function to describe the hierachical neuron distribution between:
    • total neuron:

    • neuron per layer:

Custom Sampling Function

split_sampling(num_ele, num_layers=None, n_min=1, n_max=None, n_samples=1, prepend=[], postpend=[], single_sample=False)

   num_ele: Total number of elements to be distributed
   n_min: Minimum number of elements per output dimension
   n_max: Maximum number of elements per output dimension
   num_layers: Number of layers to distribute the element, random dimensions will be given with None given

parameters_sampling(num_params, num_layers, in_dim, out_dim=1, n_min=1, n_max=None, n_samples=1, include_inout=True, single_sample=False, max_trials=1000)

   num_params: Total number of parameters to be distributed
   num_layers: Total number of layers
   n_min:      Minimum number of neurons per layer
   n_max:      Maximum number of neurons per layer
   in_dim:     Number of neurons at the input layer
   out_dim:    Number of neurons at the output layer
   n_samples:  Number of architecture samples to return (maximum number of samples return if there are less than demanded)
   include_inout: Flag indicate whether to include input and output layer neurons with samples
   max_trials: Maximum number of randomized trial for solution sampling if not enough samples found

Installation

Use the package manager pip to install CDST.

pip install git+https://github.com/Bayer-Group/cdst.git

Usage

import cdst

Contributing

For major changes, please open an issue first to discuss what you would like to change. For collaborative development, please initiate developement branch in the git repository and submit for approval prior merging into the master branch.

Please make sure to update tests as appropriate.

License

BSD-3-Clause License

Written by Calvin W.Y. Chan calvin.chan@bayer.com, March 2022 (Github: https://github.com/calvinwy, Linkedin: https://www.linkedin.com/in/calchan/)

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdst-0.1.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

cdst-0.1-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file cdst-0.1.tar.gz.

File metadata

  • Download URL: cdst-0.1.tar.gz
  • Upload date:
  • Size: 24.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.5

File hashes

Hashes for cdst-0.1.tar.gz
Algorithm Hash digest
SHA256 21024b90547a01f33245e66df75c059722eee839f79b91464ad85c74faa15b93
MD5 8f35b64b1df9958baf2d26f72d32a71c
BLAKE2b-256 2a26729fcdeaeb934b12fcc3d12e3f26699f502bd8b4257c244c70456cd275c9

See more details on using hashes here.

File details

Details for the file cdst-0.1-py3-none-any.whl.

File metadata

  • Download URL: cdst-0.1-py3-none-any.whl
  • Upload date:
  • Size: 29.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 CPython/3.9.5

File hashes

Hashes for cdst-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 654e40db56389e5611d387686aa46a82b5eac0f768b61f9ed4b54055f2a19225
MD5 f05b9ff5ddfe89dbd5835f0c23af7543
BLAKE2b-256 975ef738a8ae9d2e2549969ea167b557bf28c4e9a0c06b6eac0d152a787fbf73

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page