EmbedPVP: Integration of Genomics and Phenotypes for Variant Prioritization using Deep Learning
Project description
EmbedPVP: Embedding-based Phenotype Variant Predictor
Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning.
Annotation data sources (integrated in the candidate SNP prediction workflow)
We integrated the annotations from different sources:
- Gene ontology (GO)
- Mammalian Phenotype ontology (MP)
- Human Phenotype Ontology (HPO)
- Uber-anatomy ontology (UBERON)
Dependencies
-
The code was developed and tested using Python 3.9.
-
We used (mOWL) library to process the input dataset as well as generate the embedding representation using different embedding-based methods.
You need to have JAVA and JDK installed in your machine.
Get the data
- Download all the files from data and place the uncompressed the file in the folder named
/data
. - Download the required database using CADD and follow the instructions to generate the TSV file with CADD scores for the input VCF file.
Use the tool
You can install the tool either from source or PyPi as follows:
:ballot_box_with_check: Install from source
git clone https://github.com/bio-ontology-research-group/EmbedPVP.git
cd EmbedPVP/
pip install -r requirements.txt
mkdir output
cd embedpvp
python main.py [args]
- Run the command
python main.py --help
to display help and parameters:
Usage: main.py [OPTIONS]
Options:
-d, --data-root TEXT Data root folder [required]
-i, --in_file TEXT Annotated Input VCF file [required]
-p, --pathogenicity TEXT Path to the pathogenicity prediction file (CADD) [required]
-hpo, --hpo TEXT List of phenotype codes separated by commas [required]
-m, --model_type TEXT Ontology model, one of the following (go , mp , hp, uberon, union)
-e, --embedding TEXT Preferred embedding model (e.g. TransD, TransE, TranR, ConvE ,DistMult, DL2vec, OWL2vc, EL, ELBox)
-dir, --outdir TEXT Path to the output directory
-o, --outfile TEXT Path to the results output file
--help Show this message and exit.
- Run the example:
python main.py -d ../data/ -i example_annotation.vcf.hg38_multianno.txt -p example_cadd.tsv.gz -hpo HP:0004791,HP:0002020,HP:0100580,HP:0001428,HP:0011459 -m hp -e TransE -dir ../output/ -o example_output1.tsv
Annotate VCF file (example.vcf) with the phenotypes (HP:0003701,HP:0001324,HP:0010628,HP:0003388,HP:0000774,HP:0002093,HP:0000508,HP:0000218,HP:0000007)...
|======== | 25% Annotated files generated successfully.
|================ | 50% Phenotype prediction...
|======================== | 75% Variants prediction...
|================================| 100%
The analysis is Done. You can find the priortize list in the output file: ../output/example_output.txt
:ballot_box_with_check: Install from PyPi
Output:
The script will output a ranking a score for the candidate caustive list of variants.
Reference
For further details or if you used EmbedPVP in your work, please refer to this article:
@article{althagafi2023prioritizing,
title={Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning},
author={Althagafi, Azza and Zhapa-Camacho, Fernando and Hoehndorf, Robert},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
Note
For any questions or comments please contact azza.althagafi@kaust.edu.sa
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file embedpvp-1.0.5-py3-none-any.whl
.
File metadata
- Download URL: embedpvp-1.0.5-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2367dc861f4470c535d705fbcf0cc9cca0c28d79a1a1ac5bc2e0692ac6dae70 |
|
MD5 | cf5f57862592a4d53abc563682670729 |
|
BLAKE2b-256 | d202a0890405bc7a836d697502cbbf6a8d795f1ec5879945818621491da931f4 |