A python package to quickly get and visualise the gene order/synteny around a target gene from one or more species.

Project description

EZgenesynteny

Query GenBank to obtain gene orders from genomic records and simply visualise synteny between different species.

getgenes.py: query GenBank database with species and gene name to obtain order of surrounding protien-coding genes
genevis.py: functions to graphically visualise the gene order/synteny and save to image file
emailaddress.py: check and update the email address used to make GenBank Entrez queries

Author

Jake Leyhr (@jakeleyhr)
https://github.com/jakeleyhr/EZgenesynteny

Dependencies

Python 3.11
requests
packaging
biopython
configparser
matplotlib

Getting started

Install and open Miniconda
Create an environment with python 3.11 e.g:

conda create -n ezgenesyntenyenv python=3.11

Activate (enter) the environment:

conda activate ezgenesyntenyenv

Install the package (this automatically installs all dependencies as well):

pip install ezgenesynteny

Navigate to the folder you want to deposit the output files in:

cd \path\to\working\directory

Then you're ready to begin!

ezgenesynteny usage

$ ezgenesynteny -h
usage: ezgenesynteny [-h] [-s SPECIES [SPECIES ...]] [-g GENE_NAME] -up UPGENES -down DOWNGENES
                   [-plot PLOTNAME] [-csv CSVNAME] [-f INPUT_FILE]

Query the GenBank database with species and gene names to obtain a list of genes upstream 
and downstream of target gene.

options:
  -h, --help            show this help message and exit
  -s SPECIES [SPECIES ...], --species SPECIES [SPECIES ...]
                        Species name(s) (e.g., 'Homo_sapiens' or 'Human')
  -g GENE_NAME, --gene_name GENE_NAME
                        Gene name (e.g. BRCA1 or brca1)
  -up UPGENES, --upgenes UPGENES
                        Number of upstream genes to search for
  -down DOWNGENES, --downgenes DOWNGENES
                        Number of downstream genes to search for
  -plot PLOTNAME, --plotname PLOTNAME
                        Output file name for the gene order plot
  -csv CSVNAME, --csvname CSVNAME
                        Output file name for the gene order CSV
  -f INPUT_FILE, --input_file INPUT_FILE
                        Path to a text file containing a list of species and genes

ezgenesynteny inputs:

-s, -g, -up, -down

The simple command line inputs are the species name (-s) and gene name (-g), and then the number of upstream (-up) and downstream (-down) genes to be obtained. For example, to get the human gdf5 gene with 5 upstream genes and downstream:

$ ezgenesynteny -s human -g gdf5 -up 5 -down 5

This produces the following output in the terminal:

▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ ezgenesynteny: human gdf5 ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒

Gene info:
Name: GDF5
Description: growth differentiation factor 5
Synonyms: ['OS5', 'LAP4', 'BDA1C', 'BMP14', 'CDMP1', 'LAP-4', 'SYM1B', 'SYNS2', 'BMP-14', 
'DUPANS']
Locus: 20q11.22
Strand: reverse

Using record 0:
Organism: human (Homo sapiens)
Assembly: Chromosome 20 Reference GRCh38.p14 Primary Assembly
Accession: NC_000020
Location: 35433347:35454749
Length: 21403bp

Parsing upstream genomic record (35454750:40454749)...
Record file parsed in 2.8 seconds
Genes in region: {'CEP250': '+', 'C20orf173': '-', 'ERGIC3': '+', 'SPAG4': '+', 'CPNE1': 
'-', 'RBM12': '-', 'NFS1': '-', 'ROMO1': '+', 'RBM39': '-', 'PHF20': '+', 'SCAND1': '-', 
'CNBD2': '+', 'EPB41L1': '+', 'AAR2': '+', 'DLGAP4': '+', 'MYL9': '+', 'TGIF2': '+', 
'TGIF2-RAB5IF': '+', 'RAB5IF': '+', 'SLA2': '-', 'NDRG3': '-', 'DSN1': '-', 'MTCL2': '-', 
'TLDC2': '+', 'SAMHD1': '-', 'RBL1': '-', 'MROH8': '-', 'RPN2': '+', 'GHRH': '-', 
'MANBAL': '+', 'SRC': '+', 'BLCAP': '-', 'NNAT': '+', 'CTNNBL1': '+', 'VSTM2L': '+', 
'TTI1': '-', 'RPRD1B': '+', 'TGM2': '-', 'KIAA1755': '-', 'BPI': '+', 'LBP': '+', 
'LOC124904958': '+', 'RALGAPB': '+', 'ADIG': '+', 'ARHGAP40': '+', 'SLC32A1': '+', 
'ACTR5': '+', 'PPP1R16B': '+', 'FAM83D': '+', 'DHX35': '+'}

Parsing downstream genomic record (30433347:35433346)...
Record file parsed in 2.6 seconds
Genes in region: {'LOC124904970': '+', 'DEFB115': '+', 'DEFB116': '-', 'DEFB118': '+', 
'DEFB119': '-', 'DEFB121': '-', 'DEFB123': '+', 'DEFB124': '-', 'REM1': '+', 'HM13': '+', 
'MCTS2': '+', 'ID1': '+', 'COX4I2': '+', 'BCL2L1': '-', 'TPX2': '+', 'MYLK2': '+', 
'FOXS1': '-', 'DUSP15': '-', 'TTLL9': '+', 'PDRG1': '-', 'XKR7': '+', 'CCM2L': '+', 
'HCK': '+', 'TM9SF4': '+', 'PLAGL2': '-', 'POFUT1': '+', 'KIF3B': '+', 'ASXL1': '+', 
'NOL4L': '-', 'C20orf203': '-', 'COMMD7': '-', 'DNMT3B': '+', 'MAPRE1': '+', 'EFCAB8': 
'+', 'SUN5': '-', 'BPIFB2': '+', 'BPIFB6': '+', 'BPIFB3': '+', 'BPIFB4': '+', 'BPIFA2': 
'+', 'BPIFA3': '+', 'BPIFA1': '+', 'BPIFB1': '+', 'CDK5RAP1': '-', 'SNTA1': '-', 
'CBFA2T2': '+', 'NECAB3': '-', 'C20orf144': '+', 'ACTL10': '+', 'E2F1': '-', 'PXMP4': 
'-', 'ZNF341': '+', 'CHMP4B': '+', 'RALY': '+', 'EIF2S2': '-', 'ASIP': '+', 'AHCY': '-', 
'ITCH': '+', 'DYNLRB1': '+', 'MAP1LC3A': '+', 'PIGU': '-', 'TP53INP2': '+', 'NCOA6': '-', 
'GGT7': '-', 'ACSS2': '+', 'GSS': '-', 'MYH7B': '+', 'TRPC4AP': '-', 'MMP24-AS1-EDEM2': 
'-', 'EDEM2': '-', 'PROCR': '+', 'MMP24': '+', 'MMP24OS': '-', 'EIF6': '-', 'FAM83C': '-', 
'UQCC1': '-'}


Trimming to max. 5 upstream genes and max. 5 downstream genes with GDF5 on forward strand:
[('CPNE1', '+'), ('SPAG4', '-'), ('ERGIC3', '-'), ('C20orf173', '+'), ('CEP250', '-'), 
('GDF5', '+'), ('UQCC1', '+'), ('FAM83C', '+'), ('EIF6', '+'), ('MMP24OS', '+'), ('MMP24', 
'-')]

Completed in 19.3 seconds

The species name can be entered either as the common name (human) or latin name (homo_sapiens). In latin names or common names with multiple words, underscores must be used to separate the words (e.g. carcharodon_carcharias or great_white_shark). A wide variety of gene synonyms can be searched for rather than the specific or canonical gene symbol. It's important to be aware of this - for example, if you search for the zebrafish shh gene. In this case, as a result of the teleost whole-genome duplication, zebrafish possess two paralogs of shh: shha and shhb, but the result will be the shha gene only, because shh is listed as one of its synonyms while it isn't for shhb.
You can also search for genes using refseq transcript or protein IDs (e.g. NM_008109.4 or NP_032135.2), but be aware that the species associated with these IDs will take precedence over the species given as an argument in -s. For example, NM_008109.4 is from the mouse genome, so even if the command given is -s human -g NM_008109.4, the result will be the mouse gene, so be careful to check the "Organism" line that gets printed in the terminal to be sure. The way the GenBank search works is essentially the same as writing e.g. "human gdf5" into the search bar on the NCBI gene website, so you can test a query there if you encounter a problem.

Details about the target gene are printed in the terminal, followed by lists of upstream and downstream protein-coding genes (from the perspective of the target gene's orentiation). The searched regions in these case are 5Mb upstream and downstream, because the region lengths are specified by 1Mb multiplied by the number of genes specified by -up and -down (in this example 5). Finally, these long lists of genes are trimmed to the specified number, producing in this example an ordered list of 11 genes representing the 5 upstream genes, the target gene, and the 5 downstream genes, including the strand orientation of each gene.

You can enter multiple species names to get an easy comparison of the local gene synteny around homologous target genes. For example:

$ ezgenesynteny -s human mouse chicken -g gdf5 -up 5 -down 5

The program loops through the different species and prints each output as above to the terminal.

-f

Alternatively, a set of species and gene names can be input from a plain text file in the working directory filepath, where the text file looks like this:

human, nkx2-5
mouse, nkx2-5
stegostoma_tigrinum, LOC125458387

This input method is particularly useful for cases like the one above where the homologous target gene has significantly different names in different species. Searching for the nkx2-5 gene in human and mouse returns the gene, but returns no results in the zebra shark (stegostoma_tigrinum), because it's not completely annotated in the genome assembly of this species. When running the program with an input text file like this, it loops through each species with its corresponding gene name and prints the results to the terminal e.g:

$ ezgenesynteny -f inputfile.txt -up 5 -down 5

-plot

To visualise the gene order/synteny data as a gene arrow map (saved to an image file), add the -plot command with the desired output file name with format suffix, for example:

$ ezgenesynteny -f inputfile.txt -up 5 -down 5 -plot nkx2-5plot.png

When this command is run, the program prints the results to the terminal as usual, but also saves this image to the working directory: nkx2-5plot

The target gene is coloured in red, and then forward/reverse strand genes are coloured in yellow/blue. A wide range of image formats are supported, including PDF and EPS which can be imported and edited in a vector graphics editor such as Adobe Illustrator or GIMP. If no format suffix is specified (e.g. '-plot nkx2-5plot'), PDF is selected as default.

-csv

To save the gene order/synteny data to a CSV file, add the -csv command with the desired CSV filename (without suffix), for example:

$ ezgenesynteny -f inputfile.txt -up 5 -down 5 -plot nkx2-5plot.png -csv nkx2-5csv

This can be particularly useful for manually editing the spacing of the genes in a table to align the homologous genes.

changeemail usage

$ changeemail -h
usage: changeemail [-h] [-check] [-update]

Manage email address for GenBank Entrez queries. Only stored locally and sent with 
queries to NCBI, nowhere else.

options:
  -h, --help  show this help message and exit
  -check      Check the current saved email address
  -update     Update the email address

The first time you try to run ezgenesyteny, you will be prompted for an email address as NCBI requires this to send queries via Entrez. After you enter this email address the first time, you won't need to enter it again, as it saved locally to a config file in the package directory. It is never shared with anyone except NCBI when you send queries via Entrez. This accessory module provides the option to check or update (change) this email address if you decide to use a different address at a later date. Any dummy email address or word (e.g. 'dummy@gmail.com' or 'dummy') can also be used if preferred.

Bugs

Please submit via the GitHub issues page.

Software Licence

GPLv3

Project details

Release history Release notifications | RSS feed

0.0.5

Mar 18, 2024

0.0.4

Mar 14, 2024

0.0.3

Mar 14, 2024

0.0.2

Mar 14, 2024

This version

0.0.1

Mar 10, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezgenesynteny-0.0.1.tar.gz (31.2 kB view hashes)

Uploaded Mar 10, 2024 Source

Built Distribution

ezgenesynteny-0.0.1-py3-none-any.whl (28.2 kB view hashes)

Uploaded Mar 10, 2024 Python 3

Hashes for ezgenesynteny-0.0.1.tar.gz

Hashes for ezgenesynteny-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ed01118a2d8dfc3bb5bb518ef0a0a4f57ab8b235fe4094f3ee5f18524c7b2afc`
MD5	`711223be851e2accda766712001a4715`
BLAKE2b-256	`662f543de8bac2e35c14ac98f9b19af2cf4b81c6d6dfc61f680df6a2ce3c1642`

Hashes for ezgenesynteny-0.0.1-py3-none-any.whl

Hashes for ezgenesynteny-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0662d5394b3d527306ebc4ff37ee8e1eb071d3564ef0e63be4b8a5e885dcae02`
MD5	`cd31fc493523e07a7358142cc2c5131e`
BLAKE2b-256	`1a344541a80cd0b7ebed2f94f49154f56bc38dbf5dc16e6a00e76e298d4cc530`