API-enabled Gene Annotation
annoPipeline - an API-enabled gene annotation pipeline
Generates a pandas DataFrame with gene symbol, gene name, EntrezID, and bibliographic info for up to 5 pubmed publications where a functional reference was provided (more about functional references at GeneRIF).
Designed to be useful for tasks such as:
- identifying relevant publications for a given function
- analyzing publications trends for genes belonging to a common pathway
- identifying influential PIs for a given gene network.
Written for use with Python 3.7, not tested on other versions.
- numpy >= 1.16.2
- pandas >= 0.24.2
- Biopython >= 1.73
- openpyxl >= 2.6.1
- requests >= 2.21.0
pip install annoPipeline
Or clone the repo from github. Then, in the annoPipeline directory, run:
python setup.py install
Required dependencies will be installed if missing, may take a few seconds.
Execute the full annotation pipeline on a list of gene symbols like this:
import annoPipeline as ap # define a list of genes you would like annotated geneList = ['CDK2', 'FGFR1', 'SLC6A4'] # annoPipeline will execute full annotation pipeline (see individual functions below). df = ap.annoPipeline(geneList) # returns pandas df with annotations for gene and bibliographic info.
- ap.annoPipeline will default save annotation output to Excel file named by geneList symbols separated by '_'.
If querying a single gene, still pass as a list. For example:
import annoPipeline as ap df = ap.annoPipeline(['CDK2']) # for single gene queries still include  - will be fixed in later version
- From the MyGeneInfo API, use the “Gene query service" GET method to return details on a given list of human gene symbols.
- From the returned json, parse out the “name", “symbol" and “entrezgene" values and print to screen
import annoPipeline as ap geneList = ['CDK2', 'FGFR1', 'SLC6A4'] l1 = ap.queryGenes(geneList) # returns list of dicts where keys are default mygene fields (symbol,name,taxid,entrezgene,ensemblgene)
- Using the appropriate identifier from the above result, send a query to the MyGeneInfo “Gene annotation services" method for each gene
- From the resulting json, collate up to 5 generif descriptions per gene
- Write the results to an Excel spreadsheet with columns: gene_symbol, gene_name, entrez_id, generifs
import annoPipeline as ap geneList = ['CDK2', 'FGFR1', 'SLC6A4'] l1 = ap.queryGenes(geneList) l2 = ap.getAnno(l1, saveExcel=True) # saveExcel defaults False
- returns pandas df with genes and up to 5 generifs from mygene.info.
- default saveExcel=False, to save output to Excel must state True
- if True, Excel file will be named by geneList symbols separated by '_'.
- Use the Pubmed IDs associated with the above generif content to extract additional bibliographic information.
import annoPipeline as ap geneList = ['CDK2', 'FGFR1', 'SLC6A4'] l1 = ap.queryGenes(geneList) l2 = ap.getAnno(l1) l3 = ap.addBibs(l2) # will return df with genes and up to 5 generifs from mygene.info
- Currently returns the following bibliographic information when available:
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size annoPipeline-0.0.1-py3-none-any.whl (6.9 kB)||File type Wheel||Python version py3||Upload date||Hashes View hashes|
|Filename, size annoPipeline-0.0.1.tar.gz (5.0 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for annoPipeline-0.0.1-py3-none-any.whl