Find and cluster genomic regions containing a seed gene
Project description
GeneGrouper is a command-line tool that places gene clusters into groups according to how conserved their gene content is. Instead of providing all genes in a gene cluster, you only provide the sequence of one gene and the upstream and downstream coordinates that encompass at least the entire gene cluster. Several visualizations and detailed data tables are provided for further investigation.
Installation
GeneGrouper can be installed using pip
pip install GeneGrouper
GeneGrouper has multiple dependences.
Follow the code below to create a self-contained conda environment for GeneGrouper. Recommended
Installing Python and bioinformatic dependencies for grouping
conda create -n GeneGrouper_env python=3.9
source activate GeneGrouper_env #or try: conda activate GeneGrouper_env
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
pip install biopython scipy scikit-learn pandas matplotlib GeneGrouper
conda install -c bioconda mcl blast mmseqs2 fasttree mafft
Installing R and required packages for visualizations
conda install -c conda-forge r-base=4.1.1 r-svglite r-reshape r-ggplot2 r-cowplot r-dplyr r-gggenes r-ape r-phytools r-BiocManager r-codetools
# enter R environment
R
# install additional packages from CRAN
install.packages('groupdata2',repos='https://cloud.r-project.org/', quiet=TRUE)
# install additional packages from
BiocManager::install("ggtree")
# quit
q(save="no")
For more information, see the installation wiki page
Inputs
GeneGrouper has two required inputs:
-
A translated gene sequence in fasta format (with file extension .fasta/.txt)
-
A folder containing RefSeq GenBank-format genomes (with the file extension .gbff). See instructions to download many RefSeq genomes at a time.
Basic usage
Use build_database
to make a GeneGrouper database of your RefSeq .gbff genomes
GeneGrouper -g /path/to/gbff -d /path/to/main_directory \
build_database
Use find_regions
to search for regions containing a gene of interest and output to a search-specific directory
GeneGrouper -d /path/to/main_directory -n gene_search \
find_regions \
-f /path/to/query_gene.fasta
Use visualize --visual_type main
to output visualizations of group gene architectures and their distribution within genomes and taxa
GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type main
Use visualize --visual_type group
to inspect a GeneGrouper group more closely. Replace <> with a group ID number.
GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type group <>
Use visualize --visual_type tree
to make a phylogenetic tree of each group's seed gene
GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type tree
See tutorial with provided example data
Outputs
- For each search
find_regions
outputs:
-
Four tabular files with quantitative and qualitative descriptions of grouping results.
-
One fasta file containing all genes used in the analysis.
- For each search,
visualize --visual_type main
outputs:
- Three main visualizations provided.
- For each search,
visualize --visual_type group \--group_label <n>
outputs:
-
One additional visualization per group, where
--group_label <n>
has<n>
replaced with the group number. -
Two tabular files containing subgroup information for each
--group_label <n>
supplied.
- For each search,
visualize --visual_type tree
outputs:
- One phylogenetic tree of each seed gene in each group.
See complete output file descriptions
Each search and visualization will have the following file structure. Files under visualizations
may differ.
├── main_directory
│ ├── search_results
│ │ ├── group_statistics_summmary.csv
│ │ ├── representative_group_member_summary.csv
│ │ ├── group_taxa_summary.csv
│ │ ├── group_regions.csv
│ │ ├── group_region_seqs.faa
│ │ ├── visualizations
│ │ │ ├── group_summary.png
│ │ │ ├── groups_by_taxa.png
│ │ │ ├── taxa_searched.png
│ │ │ ├── inspect_group_-1.png
│ │ │ ├── representative_seed_phylogeny.png
│ │ ├── internal_data
│ │ ├── subgroups
│ │ ├── seed_results.db
Usage options
Global flags
usage: GeneGrouper [-h] [-d] [-n] [-g] [-t]
{build_database,find_regions,visualize} ...
-h, --help show this help message and exit
-d , --project_directory
Main directory to contain the base files used for
region searching and clustering. Default=current
directory.
-n , --search_name Name of the directory to contain search-specific
results. Default=region_search
-g , --genomes_directory
Directory containing genbank-file format genomes with
the suffix .gbff. Default=./genomes.
-t , --threads Number of threads to use. Default=all threads.
Subcommands
build_database Convert a set of genomes into a useable format for
GeneGrouper
find_regions Find regions given a translated gene and a set of
genomes
visualize Visualize GeneGrouper outputs. Three visualization options are provided.
Check the --visual_type help description.
Subcommand flags
build_database
usage: GeneGrouper build_database [-h]
-h, --help show this help message and exit
find_regions
usage: GeneGrouper find_regions [-h] -f [-us] [-ds] [-i] [-c] [-hk] [--min_group_size] [-re] [--force]
-h, --help show this help message and exit
-f , --query_file Provide the absolute path to a fasta file containing a translated gene sequence.
-us , --upstream_search
Upstream search length in basepairs. Default=10000
-ds , --downstream_search
Downstream search length in basepairs. Default=10000
-i , --seed_identity
Identity cutoff for initial blast search. Default=60
-c , --seed_coverage
Coverage cutoff for initial blast search. Default=90
-hk , --seed_hits_kept
Number of blast hits to keep. Default=None
--min_group_size
The minimum number of gene regions to constitute a group. Default=ln(jaccard distance length)
-re , --recluster_iterations
Number of region re-clustering attempts after the initial clustering. Default=0
--force Flag to overwrite search name directory.
visualize
usage: GeneGrouper visualize [-h] [--visual_type] [--group_label]
--visual_type Choices: [main, group, tree]. Use main for main visualizations. Use group to
inspect specific group. Use tree for a phylogenetic tree of representative
seed sequencess. Default=main
--group_label The integer identifier of the group you wish to inspect. Default=-1
--image_format Choices: [png, svg]. Output image format. Use svg if you want to edit the
images. Default=png.
--tip_label_type Choices: [full, group]. Use full to include the sequence ID followed by group
ID. Use group to only have the group ID. Default=full
--tip_label_size Specify the tip label size in the output image. Default=2
Citation
Alexander G McFarland, Nolan W Kennedy, Carolyn E Mills, Danielle Tullman-Ercek, Curtis Huttenhower, Erica M Hartmann, Density-based binning of gene clusters to infer function or evolutionary history using GeneGrouper, Bioinformatics, 2021;, btab752, https://doi.org/10.1093/bioinformatics/btab752
Contact
Please message me at alexandermcfarland2022@u.northwestern.edu
Follow me on twitter @alexmcfarland_!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file GeneGrouper-1.0.3.tar.gz
.
File metadata
- Download URL: GeneGrouper-1.0.3.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f8eb0fb6f0bb19558ec49879612ee39a88aab9be68defab28d5b4ed6d6f90cec |
|
MD5 | 335a17ebf09267c83ce8ca658b222657 |
|
BLAKE2b-256 | 269ba432e0124b851931e00c00871b667a06f318bc23c46edab6fb7eb24a6c64 |
File details
Details for the file GeneGrouper-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: GeneGrouper-1.0.3-py3-none-any.whl
- Upload date:
- Size: 42.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6a9eaa60168ff2df05269566f13e00f1b72a0651a639ce00797999796dfd8ed |
|
MD5 | 293242f621ea050d0f8178b86c11cc7b |
|
BLAKE2b-256 | 1dc3fd3b2399781722a31bcb6b46f702d996e2fdc5f2b1ee8a096d3c89d2790d |