Find and cluster genomic regions containing a seed gene

Project description


GeneGrouper is a command-line tool that finds gene clusters in a set of genomes and bins them into groups of similar gene clusters.

GeneGrouper overview figure

Quick Overview


  1. A translated gene of interest (.faa/.fasta/.txt)
  2. A set of genomes from RefSeq (.gbff)


  1. A database of the genomes to be searched needs to be built once
  2. Afterwards, each individual search for a gene of interest will return a new folder containing all gene clusters and their groupings

Visualizations and data

  1. GeneGrouper produces 4 different visualizations to understand the pan-genomic context of gene cluster bins.

  2. Several datasets are outputted for further inspection by the user.


For 1,130 genomes and using a 2.2Ghz quad-core MacBook Pro, GeneGrouper:

  1. Builds a database in 8 minutes (this step is only done once)

  2. Neatly bins 2,300 separate gene clusters in ~4 minutes

Quick Start

Use build_database to make a database of your RefSeq .gbff genomes

GeneGrouper -g /path/to/gbff -d /path/to/output_directory \

Use find_regions to search for gene clusters and output to a search-specific directory, 'gene_name'

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta 

Visualize gene clusters and their distribution among genomes and taxa

GeneGrouper -d /path/to/output_directory -n gene_name \
visualize -vt main

Additional usage cases

Search for gene clusters and define the genomic window

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -us 2000 -ds 18000

Search for gene clusters containing a seed gene with >=70% identity and >=90% coverage to the query gene

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -i 70 -c 90

Allow for up to one gene cluster found per genome

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -hk 1

Have two gene cluster re-clustering iterations

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -re 2

Do it all at once

GeneGrouper -d /path/to/output_directory -n gene_name \
find_regions -f /path/to/seed_gene.fasta -us 2000 -ds 18000 -i 70 -c 90 -hk 1 -re 2 

Visualize all subclusters within cluster label 'c0'

GeneGrouper -d /path/to/output_directory -n gene_name \
region_cluster -vt region_cluster -clab 0


Top-level commands

GeneGrouper [-h] [-d PROJECT_DIRECTORY] [-n SEARCH_NAME]
                     [-g GENOMES_DIRECTORY] [-t THREADS]
                     {build_database,find_regions,visualize} ...

optional arguments:
  -h, --help            show this help message and exit
                        main directory to contain the base files used for
                        region searching and clustering
  -n SEARCH_NAME, --search_name SEARCH_NAME
                        name of the directory to contain search-specific
                        directory containing genbank-file format genomes with
                        the suffix .gbff
  -t THREADS, --threads THREADS
                        number of threads to use

  valid subcommands

                        sub-command help
    build_database      convert a set of genomes into a useuable format for
    find_regions        find regions given a translated gene and a set of
    visualize           visualize region clusters

find_regions commands

GeneGrouper find_regions 
  -h, --help            show this help message and exit
  -f SEED_FILE, --seed_file SEED_FILE
                        provide the absolute path to a fasta file containing a
                        translated gene sequence
  -us UPSTREAM_SEARCH, --upstream_search UPSTREAM_SEARCH
                        upstream search length in basepairs
                        downstream search length in basepairs
  -i SEED_IDENTITY, --seed_identity SEED_IDENTITY
                        identity cutoff for initial blast search
  -c SEED_COVERAGE, --seed_coverage SEED_COVERAGE
                        coverage cutoff for initial blast search
  -hk SEED_HITS_KEPT, --seed_hits_kept SEED_HITS_KEPT
                        number of blast hits to keep
                        number of region re-clustering attempts after the
                        initial clustering

visualize commands

GeneGrouper visualize 
optional arguments:
  -h, --help            show this help message and exit
  -vt {main,region_cluster}, --visual_type {main,region_cluster}
  -clab CLUSTER_LABEL, --cluster_label CLUSTER_LABEL



Simple installation assuming you already have dependencies installed.

check me out


Instructions for creating a self-contained conda environment for GeneGrouper with all required dependencies.

conda create -n GeneGrouper_env python=3.9

source activate GeneGrouper_env

conda config --add channels defaults

conda config --add channels bioconda

conda config --add channels conda-forge

pip install biopython scipy scikit-learn pandas matplotlib

conda install -c conda-forge -c bioconda mmseqs2

conda install -c bioconda mcl

conda install -c bioconda blast

If you do not have R in your path (i.e. the command which R does not print a /path/to/R), you can install R using conda:

conda install -c conda-forge r-base

If you already have R installed, or after installing R, install the following packages from the CRAN repository:

(This might take a while if you have a fresh installation of R!)


packages <- c("reshape", "ggplot2", "cowplot", "dplyr", "gggenes", "groupdata2")
install.packages(setdiff(packages, rownames(installed.packages()))) 

  1. Download GeneGrouper
pip install GeneGrouper


