Dual-extraction method for phenotypic prediction and functional gene mining of complex traits.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DEM

Dual-extraction method for phenotypic prediction and functional gene mining

The DEM software comprises 4 functional modules: data preprocessing, dual-extraction modeling, phenotypic prediction, and functional gene mining.

modules of biodem

Installation

System requirements

Python 3.11.
Graphics: GPU with PyTorch support.

Recommended: NVIDIA graphics card with 12GB memory or larger.

Install `biodem` package

Conda / Mamba or virtualenv is recommended for installation.

Create a conda environment:

conda create -n dem python=3.11
conda activate dem

# Install PyTorch with CUDA support
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Install biodem package:
- Simple installation from PyPI
```
pip install biodem
```

Test installation

A pipeline for testing is provided in ./test/test_biodem.py. It's also a simple usage example of biodem.

cd test
python test_biodem.py

DEM (biodem) includes 4 functional modules:

1. Data preprocessing

Nested cross-validation is recommended for data preprocessing, before these steps.

Steps:
1. Split the data into nested cross-validation folds.
2. Imputation & standardization.
3. Feature selection using the variance threshold method, PCA, and RF for multi-omics data.
4. SNP2Gene transformation.

Related functions:

Function	Command line tool
`filter_na_pheno`	`filter-pheno`
`KFoldSplitter`	-
`data_prep_ncv_regre`	`ncv-prep-r`
`data_prep_ncv_class`	`ncv-prep-c`
`impute_omics`	`dem-impute`
`select_varpca`	`dem-select-varpca`
`select_rf`	`dem-select-rf`
`build_snp_dict`	- Instruction (in Julia)
`encode_vcf2matrix`	- Instruction (in Julia)
`process_avail_snp`	-
`SNPDataModule`	-
`SNP2GeneTrain`	-
`SNP2GenePipeline`	`dem-s2g-pipe`
`snp_to_gene`	-
`train_snp2gene`	-

2. Dual-extraction modeling

It takes preprocessed multi-omics data and phenotypic data as inputs for nested cross-validation, based on which the DEM model is constructed. It is capable of performing both classification and regression tasks.
Related function:

Function Command line tool

DEMLTNDataModule -

DEMTrain -

DEMTrainPipeline dem-train-pipe

Function	Command line tool
`DEMLTNDataModule`	-
`DEMTrain`	-
`DEMTrainPipeline`	`dem-train-pipe`

3. Phenotypic prediction

It takes a pretrained model and omics data of new samples as inputs and returns the predicted phenotypes.
Related function:

Function Command line tool

DEMPredict -

predict_pheno dem-predict

Function	Command line tool
`DEMPredict`	-
`predict_pheno`	`dem-predict`

4. Functional gene mining

It performs functional gene mining based on the trained DEM model through feature permutation.
Related function:

Function Command line tool

DEMFeatureRanking -

rank_feat dem-rank

Function	Command line tool
`DEMFeatureRanking`	-
`rank_feat`	`dem-rank`

How to use `biodem`

We provide both command line tools and importable functions for biodem.

DEM is mainly implemented in Python. We also provide two Julia scripts for VCF&GFF file processing and SNP encoding.

Input and output file formats

Please refer to the directory ./data for the file formats that will be used in following methods.

Import `biodem` in your Python project

The main functions' purposes and parameters are described in related sections of Use command line tools. We provide a simple example analysis pipeline in ./test/test_biodem.py to demonstrate how to use biodem in Python.

# An example:

# Use a pretrained DEM model to predict phenotypes of given omics data files
from biodem import predict_pheno

# Define the paths to your model and omics data files
model_path = 'dem_model.ckpt'
omics_file_paths = ['omics_type_1.csv', 'omics_type_2.csv']
output_dir = 'dir_predictions'

# Run the DEM model to predict phenotypes
predict_pheno(
    model_path = model_path,
    omics_paths = omics_file_paths,
    result_dir = output_dir,
    batch_size = 16,
    map_location = None,
)

Use command line tools

In terminal, run the command like dem-impute --help or dem-impute -h for help.

`ncv-prep-r`

The processing steps are the same as "1. Data preprocessing".

# Usage:
ncv-prep-r -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -n 1000 -x <trait> --raw-label <labels.csv> --n-trees 2000 --na 0.1

Parameters	Type	Required	Description
`-K` `--loop-outer`	int	*	Number of outer loops for nested cross-validation.
`-k` `--loop-inner`	int	*	Number of inner loops for nested cross-validation.
`-i` `--input`	str	*	Path to the input omics/phenotypes data.
`-o` `--dir-out`	str	*	Path to the output directory.
`-t` `--threshold-var`	float	*	Threshold for variance selection.
`-n` `--n-rf-selected`	int	*	Number of selected features by random forest.
`-x` `--trait`	str	*	Name of the phenotype column in the input data.
`--raw-label`	str	*	Path to the raw labels data.
`--n-trees`	int	*	Number of trees in random forest.
`--na`	float	*	Threshold for missing value.

`ncv-prep-c`

The processing steps are the same as "1. Data preprocessing"

# Usage:
ncv-prep-c -K 5 -k 5 -i <input.csv> -o <output_dir> -t 0.01 -r 0.9 -x <trait> --raw-label <labels.csv> --na 0.1

Parameters	Type	Required	Description
`-K` `--loop-outer`	int	*	Number of outer loops for nested cross-validation.
`-k` `--loop-inner`	int	*	Number of inner loops for nested cross-validation.
`-i` `--input`	str	*	Path to the input omics/phenotypes data.
`-o` `--dir-out`	str	*	Path to the output directory.
`-t` `--threshold-var`	float	*	Threshold for variance selection.
`-r` `--target-var-ratio`	float	*	Target variance ratio for PCA.
`-x` `--trait`	str	*	Name of the phenotype column in the input data.
`--raw-label`	str	*	Path to the raw labels data.
`--na`	float	*	Threshold for missing value.

`dem-impute`

The tool contains 3 steps:

Delete omics features with missing values exceeding 25%.
Impute missing values with mean value of each feature.
Min-max scaling.

# Usage:
dem-impute -I <an_omics_file_in> -O <an_omics_file_out> -i <a_phenotypes_file_in> -o <phenotypes_out> -p <NA_threshold> -m <is_minmax_omics> -z <is_zscore_phenotypes>

Parameters	Type	Required	Descriptions
`-I` `--inom`	string	optional	Input a path to an omics file
`-O` `--outom`	string	optional	Define your output omics file path
`-i` `--inph`	string	optional	Input a path to a trait's phenotypes
`-o` `--outph`	string	optional	Define your output phenotypes path
`-p` `--propna`	float	optional	The allowed max proportion of missing values in a feature (DEFAULT: 0.25)
`-m` `--minmax`	int, 0 or 1	optional	Whether min-max scaling for omics is required (0 denotes False, 1 denotes True, DEFAULT: 1)
`-z` `--zscore`	int, 0 or 1	optional	Whether z-score transformation for phenotypes is required (0 denotes False, 1 denotes True)

`dem-select-varpca`

To analyze the significance of each feature across different phenotype classes and save important features, the following steps were taken:

Remove features with variance less than the given threshold.
The ANOVA F value for each feature was calculated.
Features with the lowest F values were sequentially removed.
This process continued until the first principal component of the remaining features explained less than the given percent of the variance.

# Usage:
dem-select-varpca -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -V <var_threshold> -P <target_variance_ratio>

Parameters	Type	Required	Descriptions
`-I` `--inom`	string	*	Input a path to an omics file
`-i` `--inph`	string	*	Input a path to a trait's phenotypes
`-O` `--outom`	string	*	Define your output omics file path
`-V` `--minvar`	float		The allowed minimum variance of a feature (DEFAULT: 0.0)
`-P` `--varpc`	float		Target variance of PC1 (DEFAULT: 0.5)

`dem-select-rf`

The random forest (RF) algorithm, based on an ensemble of the given number of decision trees, was employed to screen out representative omics features for subsequent DEM construction.

# Usage:
dem-select-rf -I <an_omics_file_in> -i <a_phenotypes_file_in> -O <omics_file_output> -n <num_features_to_save> -p <validation_data_proportion> -N <num_trees> -S <random_seed_rf> -s <random_seed_sp>

Parameters	Type	Required	Descriptions
`-I` `--inom`	string	*	Input a path to an omics file
`-i` `--inph`	string	*	Input a path to a trait's phenotypes
`-O` `--outom`	string	*	Define your output omics file path
`-n` `--nfeat`	int	*	Number of features to save
`-N` `--ntree`	int		Number of trees in the random forest (DEFAULT: 2500) (larger number of trees will result in more accurate results, but will take longer to run)
`-S` `--seedrf`	list[int]		Random seeds for RF (DEFAULT: 1000, 1001, ..., 1009)

`dem-s2g-pipe`

The pipeline for SNP2Gene modeling and transformation based on nested cross-validation.

# Usaage example:
dem-s2g-pipe -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --h5 <path_h5_processed> --json <path_json_genes_snps> --log-dir <log_dir> --o-s2g-dir <dir_to_save_converted>

Parameters	Type	Required	Descriptions
`-t` `--trait`	string	*	Trait name
`-l` `--lbl-class`	int	*	Number of classes for trait
`--inner-lbl`	string	*	Directory to inner label files
`--outer-lbl`	string	*	Directory to outer label files
`--h5`	string	*	Path to h5 file containing genotype data
`--json`	string	*	Path to json file containing SNP-gene mapping information
`--log-dir`	string	*	Directory to save log files and models
`--o-s2g-dir`	string	*	Directory to save SNP2Gene results

`dem-train-pipe`

The pipeline for DEM modeling and prediction on nested cross-validation datasets with hyperparameter optimization. DEM aims to achieve high phenotypic prediction accuracy and identify the most informative omics features for the given trait. This tool is used to construct a dual-extraction model based on the given omics data and phenotype, through cross validation or random sampling.

# Usage example:
dem-train-pipe -o <log_dir> -t <trait_name> -l <n_label_class> --inner-lbl <dir_label_inner> --outer-lbl <dir_label_outer> --inner-om <dir_omics_inner> --outer-om <dir_omics_outer>

Parameters	Type	Required	Descriptions
`-o` `--log-dir`	string	*	Directory to save log files and models
`-t` `--trait`	string	*	Trait name
`-l` `--lbl-class`	integer	*	Number of classes for trait
`--inner-lbl`	string	*	Directory to inner label files
`--outer-lbl`	string	*	Directory to outer label files
`--inner-om`	string	*	Directory to inner omics files
`--outer-om`	string	*	Directory to outer omics files

`dem-predict`

Predicts the phenotype of given omics data files using a trained DEM model.

# Usage example:
dem-predict -m <model_file_path> -o <phenotype_output> -I <omics_files>

Parameters	Type	Required	Descriptions
`-I` `--inom`	list[string]	*	Input path(s) to omics file(s)
`-m` `--inmd`	string	*	The path to a pretrained DEM model
`-o` `--outdir`	string	*	The directory to save predicted phenotypes

`dem-rank`

Assess the importance of features by contrasting the DEM model’s test performance on the actual feature values against those where the feature values have been randomly shuffled. Features that are ranked highly are then identified.

# Usage example:
dem-rank -I <omics_file_path> -i <pheno_file_path> -t <trait_name> -l <n_label_class> -m <model_path> -o <output_dir> -b <batch_size> -s <seed1> <seed2>...

Parameters	Type	Required	Descriptions
`-I` `--inom`	string	*	Input path(s) to omics file(s)
`-i` `--inph`	string	*	Input a path to a trait's phenotypes
`-t` `--trait`	string	*	The name of the trait to be predicted
`-l` `--lbl-class`	int	*	Number of classes for the trait (1 for regression)
`-m` `--inmd`	string	*	The path to a pretrained DEM model
`-o` `--outdir`	string	*	The directory to save ranked features
`-b` `--batch-size`	int		Batch size for feature ranking (default: 16)
`-s` `--seeds`	integer		Random seeds for ranking repeats (default: 0-9)

Transform SNPs to Genomic Embedding

Build a dict of SNPs and their corresponding genomic regions
- It accepts a VCF file and a GFF file (reference genome) as input, and returns a dictionary of SNPs and their corresponding genomic regions.
One-hot encode SNPs
- It accepts a dictionary of SNPs and their corresponding genomic regions, and returns one-hot encoded SNPs based on the actual di-nucleotide composition of each SNP.
Transform encoded SNPs to genomic embedding
- The one-hot encoded SNPs are transformed into dense and continuous features that represent genomic variation (each feature corresponds to a gene).

Requirements

Please install biodem first.
Install Julia 1.10 .

Install the following Julia packages:

# Open Julia REPL
julia

# In Julia REPL
using Pkg
Pkg.add("JLD2")
Pkg.add("HDF5")
Pkg.add("JSON")
Pkg.add("GFF3")
Pkg.add("GeneticVariation")
Pkg.add("CSV")
Pkg.add("DataFrames")

Input and output file formats

Please refer to the example files in ./data/ for the file formats that will be used in following functions.

Example usage

Build a dictionary of SNPs and their corresponding genomic regions from a VCF file and a GFF file. The one-hot encoded SNPs are saved to a H5 file.

# In shell, enter Julia REPL
julia

# In Julia REPL

# Import functions from git repository
include("./src/biodem/utils_vcf_gff.jl")
include("./src/biodem/utils_encode_snp_vcf.jl")

# Build a dict of SNPs and their corresponding genomic regions
path_gff = "example.gff"
path_vcf = "example.vcf"
build_snp_dict(path_vcf, path_gff, true, true)

# One-hot encode SNPs
path_snp_dict = "snp_dict.jld2"# The output of build_snp_dict function
encode_vcf2matrix(path_snp_dict, true)

Pick up available SNPs and their corresponding genomic regions from the H5 file produced by encode_vcf2matrix function.

# In Python

# Import function for processing available SNPs
from biodem import process_avail_snp

path_json_snp_gene_relation = "gene4snp.json"
path_h5_snp_matrix = "snp_and_gene4snp.h5"

process_avail_snp(path_h5_snp_matrix, path_json_snp_gene_relation)
# The output will be a new H5 file containing only available SNPs and their corresponding genomic regions.

Train SNP2Gene models Please refer to the dem-s2g-pipe and snp_to_gene function for more details.

Directory structure

.
├── src                        # Python and Julia implementation for DEM
├── data                       # Data files and example files
├── test                       # Easy tests for packages and pipelines
├── pyproject.toml             # Python package metadata
├── LICENSE                    # License file
└── README.md                  # This file

./src/biodem
├── __init__.py                # Initialize the Python package
├── cli_dem.py                 # Command line interface for DEM
├── utils.py                   # Utilities for modeling and prediction
├── module_data_prep.py        # Data preprocessing
├── module_data_prep_ncv.py    # Data preprocessing for nested cross-validation
├── module_snp.py              # SNP2Gene modeling pipeline for SNP preprocessing
├── module_dem.py              # DEM modeling pipeline, Phenotypic prediction, and Functional gene mining
├── model_dem.py               # DEM model definition
├── model_snp2gene.py          # SNP2Gene model definition
├── utils_vcf_gff.jl           # Utilities for VCF and GFF file processing
└── utils_encode_snp_vcf.jl    # Utilities for one-hot encoding SNPs

Asking for help

If you have any questions please:

Contact us via GitHub.
Email us.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.6.0

May 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biodem-0.6.0.tar.gz (90.9 MB view hashes)

Uploaded May 6, 2024 Source

Built Distribution

biodem-0.6.0-py3-none-any.whl (41.3 kB view hashes)

Uploaded May 6, 2024 Python 3

Hashes for biodem-0.6.0.tar.gz

Hashes for biodem-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`a1993651d87f08e8f05ff33b3bad66d3e5e02f956f7e83941423f20fc821031d`
MD5	`3934c1c2042367d43d9d6b0c7e64f893`
BLAKE2b-256	`bbd9049507a2c3c02ba5ab29b264248ebd51a76dd1eae56f065da9181bb0d360`

Hashes for biodem-0.6.0-py3-none-any.whl

Hashes for biodem-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9301034504f208f5fb309cf6779c1bdf3e1e45dcde16792214f080e5e508a5a`
MD5	`c3b5aa969d19043d2b4a389d29938fda`
BLAKE2b-256	`f72a6b65336de9c357a12e23b99f6b73ff697ec5b82292f954d011905676e1cd`

biodem 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

DEM

Dual-extraction method for phenotypic prediction and functional gene mining

Installation

System requirements

Install biodem package

Test installation

DEM (biodem) includes 4 functional modules:

1. Data preprocessing

2. Dual-extraction modeling

3. Phenotypic prediction

4. Functional gene mining

How to use biodem

Input and output file formats

Import biodem in your Python project

Use command line tools

ncv-prep-r

ncv-prep-c

dem-impute

dem-select-varpca

dem-select-rf

dem-s2g-pipe

dem-train-pipe

dem-predict

dem-rank

Transform SNPs to Genomic Embedding

Requirements

Input and output file formats

Example usage

Directory structure

Asking for help

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Install `biodem` package

How to use `biodem`

Import `biodem` in your Python project

`ncv-prep-r`

`ncv-prep-c`

`dem-impute`

`dem-select-varpca`

`dem-select-rf`

`dem-s2g-pipe`

`dem-train-pipe`

`dem-predict`

`dem-rank`