Welcome to Open Biopipeline, an open source gene exploration tool!
Project description
This project was inspired by my intro to biomedical engineering lab course. This fully automates and streamlines the process of analyzing unknown sequence data. You can view the python notebook below to see a working version.
Demo
Please note this is not the full pipeline, just simply the blastn search portion.
Packages
- Biopython
- NCBI BLAST 2.10.0+
- KEGG
- UniProt
- Protein Atlas
Installation
Windows and MacOS
Go to NCBI website here and download the installer. Install as you would any .exe program
Linux dependencies
uname -i
sudo apt-get install lftp
lftp -e "cd blast/executables/LATEST; dir; quit" ftp.ncbi.nlm.nih.gov | awk '{print $NF}'
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.0+-x64-linux.tar.gz
tar -xzvf ncbi-blast-2.10.0+-x64-linux.tar.gz
# remove tar file
rm ncbi-blast-2.10.0+-x64-linux.tar.gz
cd ncbi-blast-2.10.0+
# so you can run the bin commands without specifying directory
export PATH = $PATH:$PWD
# or
export PATH = $PATH:$HOME/content/ncbi-blast-2.10.0+/bin
BLAST commands should work now, the following command should return an output other than "Unknown command".
blastn -version
Python Setup
pip install virtualenv
virtualenv bio_pipeline
bio_pipeline\Scripts\activate
pip install -r requirements.txt
Results
Starting with an unknown sequence of nucleotides the program is fed into the BLAST algorithm to pairwise match the nucleotides for a specific gene. Refer to gene_items.md for the gene name and accession number of the example genes given.
The next step is to understand the gene/protein function by using the UniProt database. By inputting a gene name, we can look for the specific amino acid sequence AND the protein function. You can see here that we don't use gene ontologogy (GO) directly in our analysis since functional annotation is not only partially given by UniProt, but one can retrieve GO information from UniProt links.
Next, we preform pathway analysis on the targeted gene so one can identify the cellular response of the specific cell cascade. This helps us gain understanding of the upstream and downstream molecules; moreover, this can aid in therapeutic drug research. Refer to figure 1 for the pathway interactions of vascular endothelial growth factor A (VEGFA) gene.
Figure 1. Cell signaling cascade of the VEGFA protein (Note the specific link to angiogenesis, making it a target for cancer research)
Lastly, we pass through the gene name into Protein Atlas which gives us valuable and relevant information on the protein expression in patients. This proves very desirable since one can now look at specific RNA cancer expression or the description of the protein within the clinical context. Below is an example of the Protein Atlas output:
0 Protein Name: VEGFA
Tissue expression summary: Most cancers showed strong cytoplasmic immunoreactivity. Lymphomas were in general moderately stained.
Description: Antibody staining mainly consistent with RNA expression data. At least one protein variant secreted, tissue location of RNA and protein might differ and correlation is complex.
RNA Cancer Specificity: Low cancer specificity
...
Pipeline flow
Figure 2. Flow chart diagram of bioinformatic pipeline, displaying flow of input/outputs
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for open_biopipeline-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88ff1649197b7ca1f854ad4d572401cca4d6353c0b417b4ab6980d71cbdba20d |
|
MD5 | 0a108536124168d1a1b4341c38ccde83 |
|
BLAKE2b-256 | 2b2c210e3a4948f1391426d639ac8f41cf118b9560e28e8d1021f65eb7e30954 |