PineBioML is a easy use ML toolkit.
Project description
Overview
This package aims to help analysising biomedical data using ML method in python.
System requirements
- Python 3.10+
- The following python module dependencies are required:
pandas openpyxl xlrd tqdm seaborn gprofiler-official jupyter jupyterlab optuna scikit-learn umap-learn pacmap statsmodels mljar-supervised joblib
Installation
1. Install Python
Please follow the tutorial to install python (the sections "Visual Studio Code" and "Git" are optional):
https://learn.microsoft.com/en-us/windows/python/beginners
Please skip this step if you already have python 3.10+ installed in your PC.
2. Install dependencies and execute the scripts
Step 1. Download our scripts from Release and unzip it.
https://github.com/ICMOL/undetermined/releases
Step 2. Install dependencies: Please open Windows PowerShell, move to the directory of our scripts, and execute the following command.
pip install -r ./requirements.txt
Step 3. Open the jupyter interface
Please execute the following command to open jupyter. You will see the figure, if the scripts execute correctly.
> jupyter lab
or
> python -m notebook
Input Table Format
The input data should be tabular and placed in the ./input folder. We accept .csv, .tsv, .xlsx and R-table in .txt formats.
Process
0. Document
1. Missing value preprocess
ID | Option | Definition |
---|---|---|
1 | Deletion | Remove the features that are too empty. |
2 | Imputation with a constant value | Impute missing values with a constant value, such as 0 or the feature mean. |
3 | Imputation using K-NN algorithm | Impute missing values with the mean or median of the k nearest samples. |
2. Data transformation
ID | Option | Definition |
---|---|---|
1 | PCA | Principal component transform. |
2 | Power transform | To make data more Gaussian-like, you can use either Box-Cox transform or Yeo-Johnson transform. |
3 | Feature clustering | Group similar features into a cluster. |
4 | Feature expansion | Generating new features by add/product/ratio in random pair of existing features. |
3. Feature selection
ID | Option | Definition |
---|---|---|
1 | Volcano plot | Selecting by group p-value and fold change |
2 | Lasso regression | Selecting by Linear models with L1 penalty |
3 | Decision stump | Selecting by 1-layer decision tree |
4 | Random Forest | Selecting by Gini impurity or permutation importance over a Random Forest |
5 | AdaBoost | Selecting by Gini impurity over a AdaBoost model |
6 | Gradient boosting | Selecting by Gini impurity over a gradient boosting, such as XGboost or LightGBM |
7 | Linear SVM | Selecting by support vector from support vector machine |
4. Model building
ID | Option | Definition |
---|---|---|
1 | ElasticNet | Using Optuna to find a not-bad hyper parameters on given dataset. |
2 | SVM | Using Optuna to find a not-bad hyper parameters on given dataset. |
3 | Decision Tree | Using Optuna to find a not-bad hyper parameters on given dataset. |
4 | Random Forest | Using Optuna to find a not-bad hyper parameters on given dataset. |
5 | AdaBoost | Using Optuna to find a not-bad hyper parameters on given dataset. |
6 | XGBoost | Using Optuna to find a not-bad hyper parameters on given dataset. |
7 | LightGBM | Using Optuna to find a not-bad hyper parameters on given dataset. |
8 | CatBoost | Using Optuna to find a not-bad hyper parameters on given dataset. |
5. Report and visualization
ID | Option | Definition |
---|---|---|
1 | data_overview | Giving a glance to input data. |
2 | classification_summary | Summarizing a classification task |
Examples for Program Demonstration
Chosse one of the following examples, double click it in jupyter interface:
ID | Name | Description |
---|---|---|
1 | example_BasicUsage.ipynb | Demonstrate the basic features of PineBioML |
2 | example_Proteomics.ipynb | An example on proteomics data analysis |
3 | example_PipeLine.ipynb | Demonstrate how to use the pipeline to store the whole data processing flow |
4 | example_Pine.ipynb | Demonstrate how to use Pine ml to finding the best data processing flow in an efficient way |
5 | example_UsingExistingModel.ipynb | An example of unsing existing models/pipeline gained from 3. , 4. or 5. |
Click the buttom and the script should start.
Cites
The example data is from LinkedOmicsKB
A proteogenomics data-driven knowledge base of human cancer, Yuxing Liao, Sara R. Savage, Yongchao Dou, Zhiao Shi, Xinpei Yi, Wen Jiang, Jonathan T. Lei, Bing Zhang, Cell Systems, 2023.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PineBioML-1.2.0.tar.gz
.
File metadata
- Download URL: PineBioML-1.2.0.tar.gz
- Upload date:
- Size: 43.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e120159697d2eff514219c60d613aec3ff763c5c144810152052af3ab5c3b2d |
|
MD5 | b3ffbf15e4d0a908de3c6f7a7f0a4403 |
|
BLAKE2b-256 | cfcaecbbb8862b857951e67d20fa88cf0318aca2eb3f83183b92647ffdd1e467 |
File details
Details for the file PineBioML-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: PineBioML-1.2.0-py3-none-any.whl
- Upload date:
- Size: 51.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | afaad2613968e13132d40347250ad030109a4fd22875d658bfe10b45869f0132 |
|
MD5 | e257aa15aed977552be685a7e3b6f5ce |
|
BLAKE2b-256 | 3c635208ef37e399066fb49a2fcaeead274a4bdce9576f1ad288478536691733 |