Skip to main content

PineBioML is a easy use ML toolkit.

Project description

Overview

This package aims to help analysising biomedical data using ML method in python.

image

System requirements

  1. Python 3.10+
  2. The following python module dependencies are required:

pandas openpyxl xlrd tqdm seaborn gprofiler-official jupyter jupyterlab optuna scikit-learn umap-learn pacmap statsmodels mljar-supervised joblib

Installation

1. Install Python

Please follow the tutorial to install python (the sections "Visual Studio Code" and "Git" are optional):

https://learn.microsoft.com/en-us/windows/python/beginners 

Please skip this step if you already have python 3.10+ installed in your PC.

2. Install dependencies and execute the scripts

Step 1. Download the examples from Release and unzip it.

https://github.com/ICMOL/undetermined/releases

Step 2. Please open Windows PowerShell, and execute the following command.

pip install PineBioML          

Step 3. Move to the directory of the unzipped examples, and open the jupyter interface.

Please execute the following command to open jupyter. You will see the figure, if the scripts execute correctly.

> python -m notebook

image

Input Table Format

The input data should be tabular and placed in the ./input folder. We accept .csv, .tsv, .xlsx and R-table in .txt formats.

Process

0. Document

API

1. Missing value preprocess

ID Option Definition
1 Deletion Remove the features that are too empty.
2 Imputation with a constant value Impute missing values with a constant value, such as 0 or the feature mean.
3 Imputation using K-NN algorithm Impute missing values with the mean or median of the k nearest samples.

2. Data transformation

ID Option Definition
1 PCA Principal component transform.
2 Power transform To make data more Gaussian-like, you can use either Box-Cox transform or Yeo-Johnson transform.
3 Feature clustering Group similar features into a cluster.
4 Feature expansion Generating new features by add/product/ratio in random pair of existing features.

3. Feature selection

ID Option Definition
1 Volcano plot Selecting by group p-value and fold change
2 Lasso regression Selecting by Linear models with L1 penalty
3 Decision stump Selecting by 1-layer decision tree
4 Random Forest Selecting by Gini impurity or permutation importance over a Random Forest
5 AdaBoost Selecting by Gini impurity over a AdaBoost model
6 Gradient boosting Selecting by Gini impurity over a gradient boosting, such as XGboost or LightGBM
7 Linear SVM Selecting by support vector from support vector machine

4. Model building

ID Option Definition
1 ElasticNet Using Optuna to find a not-bad hyper parameters on given dataset.
2 SVM Using Optuna to find a not-bad hyper parameters on given dataset.
3 Decision Tree Using Optuna to find a not-bad hyper parameters on given dataset.
4 Random Forest Using Optuna to find a not-bad hyper parameters on given dataset.
5 AdaBoost Using Optuna to find a not-bad hyper parameters on given dataset.
6 XGBoost Using Optuna to find a not-bad hyper parameters on given dataset.
7 LightGBM Using Optuna to find a not-bad hyper parameters on given dataset.
8 CatBoost Using Optuna to find a not-bad hyper parameters on given dataset.

5. Report and visualization

ID Option Definition
1 data_overview Giving a glance to input data.
2 classification_summary Summarizing a classification task

Examples for Program Demonstration

Chosse one of the following examples, double click it in jupyter interface:

ID Name Description
1 example_BasicUsage.ipynb Demonstrate the basic features of PineBioML
2 example_Proteomics.ipynb An example on proteomics data analysis
3 example_PipeLine.ipynb Demonstrate how to use the pipeline to store the whole data processing flow
4 example_Pine.ipynb Demonstrate how to use Pine ml to finding the best data processing flow in an efficient way
5 example_UsingExistingModel.ipynb An example of unsing existing models/pipeline gained from 3. , 4. or 5.

Click the buttom and the script should start. image

Cites

The example data is from LinkedOmicsKB

A proteogenomics data-driven knowledge base of human cancer, Yuxing Liao, Sara R. Savage, Yongchao Dou, Zhiao Shi, Xinpei Yi, Wen Jiang, Jonathan T. Lei, Bing Zhang, Cell Systems, 2023.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PineBioML-1.2.1.tar.gz (42.9 kB view details)

Uploaded Source

Built Distribution

PineBioML-1.2.1-py3-none-any.whl (51.9 kB view details)

Uploaded Python 3

File details

Details for the file PineBioML-1.2.1.tar.gz.

File metadata

  • Download URL: PineBioML-1.2.1.tar.gz
  • Upload date:
  • Size: 42.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for PineBioML-1.2.1.tar.gz
Algorithm Hash digest
SHA256 345a558f18dbecb7222d0493df34d7b872de188cde8ea9f7017717ae634e9596
MD5 0eedfc07c9e92f2832d43ca09c10653a
BLAKE2b-256 1d49bde2544bdeb96b1a4fa3f64f1774e6f8a913dfd48bf251a9dd8994925ed7

See more details on using hashes here.

File details

Details for the file PineBioML-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: PineBioML-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 51.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for PineBioML-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a56dcd0dfe5ff459e75ff9922f38984c4839aeeace362db7716809582e62fcb0
MD5 d1047917202fca6bf0169650cc13c6fa
BLAKE2b-256 36fbdaf70f4c267c2585ed22ee6241301f4375b88cdec0cfc06092d3cc1be750

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page