Skip to main content

Find and cluster genomic regions containing a seed gene

Project description

GeneGrouper

GeneGrouper is a command-line tool that finds gene clusters in a set of genomes and bins them into groups of similar gene clusters.

GeneGrouper overview figure

Quick Overview

Inputs

  1. A translated gene of interest (.faa/.fasta/.txt)
  2. A set of genomes from RefSeq (.gbff)

Outputs

  1. A database of the genomes to be searched needs to be built once
  2. Afterwards, each individual search for a gene of interest will return a new folder containing all gene clusters and their groupings

Visualizations and data

  1. GeneGrouper produces 4 different visualizations to understand the pan-genomic context of gene cluster bins.

  2. Several datasets are outputted for further inspection by the user.

Performance

For 1,130 genomes and using a 2.2Ghz quad-core MacBook Pro, GeneGrouper:

  1. Builds a database in 8 minutes (this step is only done once)

  2. Neatly bins 2,300 separate gene clusters in ~4 minutes

Quick Start

Use build_database to make a database of your RefSeq .gbff genomes

GeneGrouper -g /path/to/gbff -d /path/to/output_directory \
build_database

Use find_regions to search for gene clusters and output to a search-specific directory, 'gene_name'

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta 

Visualize gene clusters and their distribution among genomes and taxa

GeneGrouper -d /path/to/output_directory -n gene_name \
visualize -vt main

Additional usage cases

Search for gene clusters and define the genomic window

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -us 2000 -ds 18000

Search for gene clusters containing a seed gene with >=70% identity and >=90% coverage to the query gene

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -i 70 -c 90

Allow for up to one gene cluster found per genome

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -hk 1

Have two gene cluster re-clustering iterations

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -re 2

Do it all at once

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -us 2000 -ds 18000 -i 70 -c 90 -hk 1 -re 2 

Visualize all subclusters within cluster label 'c0'

GeneGrouper -d /path/to/output_directory -n gene_name \
region_cluster -vt region_cluster -clab 0

Commands

Top-level commands

GeneGrouper [-h] [-d PROJECT_DIRECTORY] [-n SEARCH_NAME]
                     [-g GENOMES_DIRECTORY] [-t THREADS]
                     {build_database,find_regions,visualize} ...

optional arguments:
  -h, --help            show this help message and exit
  -d PROJECT_DIRECTORY, --project_directory PROJECT_DIRECTORY
                        main directory to contain the base files used for
                        region searching and clustering
  -n SEARCH_NAME, --search_name SEARCH_NAME
                        name of the directory to contain search-specific
                        results
  -g GENOMES_DIRECTORY, --genomes_directory GENOMES_DIRECTORY
                        directory containing genbank-file format genomes with
                        the suffix .gbff
  -t THREADS, --threads THREADS
                        number of threads to use

subcommands:
  valid subcommands

  {build_database,find_regions,visualize}
                        sub-command help
    build_database      convert a set of genomes into a useuable format for
                        GeneToRegions
    find_regions        find regions given a translated gene and a set of
                        genomes
    visualize           visualize region clusters

find_regions commands

GeneGrouper find_regions 
  -h, --help            show this help message and exit
  -f SEED_FILE, --seed_file SEED_FILE
                        provide the absolute path to a fasta file containing a
                        translated gene sequence
  -us UPSTREAM_SEARCH, --upstream_search UPSTREAM_SEARCH
                        upstream search length in basepairs
  -ds DOWNSTREAM_SEARCH, --downstream_search DOWNSTREAM_SEARCH
                        downstream search length in basepairs
  -i SEED_IDENTITY, --seed_identity SEED_IDENTITY
                        identity cutoff for initial blast search
  -c SEED_COVERAGE, --seed_coverage SEED_COVERAGE
                        coverage cutoff for initial blast search
  -hk SEED_HITS_KEPT, --seed_hits_kept SEED_HITS_KEPT
                        number of blast hits to keep
  -re RECLUSTER_ITERATIONS, --recluster_iterations RECLUSTER_ITERATIONS
                        number of region re-clustering attempts after the
                        initial clustering

visualize commands

GeneGrouper visualize 
optional arguments:
  -h, --help            show this help message and exit
  -vt {main,region_cluster}, --visual_type {main,region_cluster}
  -clab CLUSTER_LABEL, --cluster_label CLUSTER_LABEL

Installation

Simple:

Simple installation assuming you already have dependencies installed.

check me out

Detailed:

Instructions for creating a self-contained conda environment for GeneGrouper with all required dependencies.

conda create -n GeneGrouper_env python=3.9

source activate GeneGrouper_env

conda config --add channels defaults

conda config --add channels bioconda

conda config --add channels conda-forge

pip install biopython scipy scikit-learn pandas matplotlib

conda install -c conda-forge -c bioconda mmseqs2

conda install -c bioconda mcl

conda install -c bioconda blast

If you do not have R in your path (i.e. the command which R does not print a /path/to/R), you can install R using conda:

conda install -c conda-forge r-base

If you already have R installed, or after installing R, install the following packages from the CRAN repository:

(This might take a while if you have a fresh installation of R!)

r

packages <- c("reshape", "ggplot2", "cowplot", "dplyr", "gggenes", "groupdata2")
install.packages(setdiff(packages, rownames(installed.packages()))) 

q()
  1. Download GeneGrouper
pip install GeneGrouper

Dependencies:

conda list

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
_r-mutex                  1.0.1               anacondar_1    conda-forge
binutils_impl_linux-64    2.35.1               h193b22a_2    conda-forge
binutils_linux-64         2.35                h67ddf6f_30    conda-forge
biopython                 1.78                     pypi_0    pypi
blast                     2.5.0                hc0b0e79_3    bioconda
boost                     1.76.0           py39h5472131_0    conda-forge
boost-cpp                 1.76.0               h312852a_1    conda-forge
bwidget                   1.9.14               ha770c72_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h7f98852_1    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cairo                     1.16.0            h6cf1ce9_1008    conda-forge
certifi                   2020.12.5        py39hf3d152e_1    conda-forge
curl                      7.76.1               hea6ffbf_2    conda-forge
cycler                    0.10.0                   pypi_0    pypi
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.13.1            hba837de_1005    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
fribidi                   1.0.10               h36c2ea0_0    conda-forge
gawk                      5.1.0                h7f98852_0    conda-forge
gcc_impl_linux-64         9.3.0               h70c0ae5_19    conda-forge
gcc_linux-64              9.3.0               hf25ea35_30    conda-forge
genegrouper               0.0.1                    pypi_0    pypi
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
gfortran_impl_linux-64    9.3.0               hc4a2995_19    conda-forge
gfortran_linux-64         9.3.0               hdc58fab_30    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
gxx_impl_linux-64         9.3.0               hd87eabc_19    conda-forge
gxx_linux-64              9.3.0               h3fbe746_30    conda-forge
harfbuzz                  2.8.1                h83ec7ef_0    conda-forge
icu                       68.1                 h58526e2_0    conda-forge
jbig                      2.1               h7f98852_2003    conda-forge
joblib                    1.0.1                    pypi_0    pypi
jpeg                      9d                   h36c2ea0_0    conda-forge
kernel-headers_linux-64   2.6.32              h77966d4_13    conda-forge
kiwisolver                1.3.1                    pypi_0    pypi
krb5                      1.19.1               hcc1bbae_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
lerc                      2.2.1                h9c3ff4c_0    conda-forge
libblas                   3.9.0                9_openblas    conda-forge
libcblas                  3.9.0                9_openblas    conda-forge
libcurl                   7.76.1               h2574ce0_2    conda-forge
libdeflate                1.7                  h7f98852_5    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-devel_linux-64     9.3.0               h7864c58_19    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgfortran-ng            9.3.0               hff62375_19    conda-forge
libgfortran5              9.3.0               hff62375_19    conda-forge
libglib                   2.68.2               h3e27bee_0    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
libiconv                  1.16                 h516909a_0    conda-forge
libidn2                   2.3.1                h7f98852_0    conda-forge
liblapack                 3.9.0                9_openblas    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libopenblas               0.3.15          pthreads_h8fe5266_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-devel_linux-64  9.3.0               hb016644_19    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
libtiff                   4.3.0                hf544144_1    conda-forge
libunistring              0.9.10               h14c3975_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
libxcb                    1.13              h7f98852_1003    conda-forge
libxml2                   2.9.12               h72842e0_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
make                      4.3                  hd18ef5c_1    conda-forge
matplotlib                3.4.2                    pypi_0    pypi
mcl                       14.137          pl5262h779adbc_6    bioconda
mmseqs2                   13.45111             h95f258a_1    bioconda
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.3           py39hdbf815f_0    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
pandas                    1.2.4                    pypi_0    pypi
pango                     1.48.5               hb8ff022_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pcre2                     10.36                h032f7d1_1    conda-forge
perl                      5.26.2            h36c2ea0_1008    conda-forge
pillow                    8.2.0                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
pyparsing                 2.4.7                    pypi_0    pypi
python                    3.9.4           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.9                      1_cp39    conda-forge
pytz                      2021.1                   pypi_0    pypi
r-base                    4.1.0                h9e01966_1    conda-forge
readline                  8.1                  h46c0cb4_0    conda-forge
scikit-learn              0.24.2                   pypi_0    pypi
scipy                     1.6.3                    pypi_0    pypi
sed                       4.8                  he412f7d_0    conda-forge
setuptools                49.6.0           py39hf3d152e_3    conda-forge
six                       1.16.0                   pypi_0    pypi
sqlite                    3.35.5               h74cdb3f_0    conda-forge
sysroot_linux-64          2.12                h77966d4_13    conda-forge
threadpoolctl             2.1.0                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
tktable                   2.10                 hb7b940f_3    conda-forge
tzdata                    2021a                he74cb21_0    conda-forge
wget                      1.20.1               h22169c7_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.7.1                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-libxt                1.2.1                h7f98852_2    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.5.0                ha95c52a_0    conda-forge

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GeneGrouper-0.0.1.tar.gz (34.5 kB view hashes)

Uploaded Source

Built Distribution

GeneGrouper-0.0.1-py3-none-any.whl (38.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page