Find and cluster genomic regions containing a seed gene

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

GeneGrouper is a command-line tool that places gene clusters into groups according to how conserved their gene content is. Instead of providing all genes in a gene cluster, you only provide the sequence of one gene and the upstream and downstream coordinates that encompass at least the entire gene cluster. Several visualizations and detailed data tables are provided for further investigation.

Why use GeneGrouper?

See GeneGrouper tutorial

See GeneGrouper outputs

See FAQs

Installation

GeneGrouper can be installed using pip

pip install GeneGrouper

GeneGrouper has multiple dependences.

Follow the code below to create a self-contained conda environment for GeneGrouper. Recommended

Installing Python and bioinformatic dependencies for grouping

conda create -n GeneGrouper_env python=3.9

source activate GeneGrouper_env #or try: conda activate GeneGrouper_env

conda config --add channels defaults

conda config --add channels bioconda

conda config --add channels conda-forge

pip install biopython scipy scikit-learn pandas matplotlib GeneGrouper

conda install -c bioconda mcl blast mmseqs2 fasttree mafft

Installing R and required packages for visualizations

conda install -c conda-forge r-base=4.1.1 r-svglite r-reshape r-ggplot2 r-cowplot r-dplyr r-gggenes r-ape r-phytools r-BiocManager r-codetools

# enter R environment
R

# install additional packages from CRAN
install.packages('groupdata2',repos='https://cloud.r-project.org/', quiet=TRUE)

# install additional packages from 
BiocManager::install("ggtree")

# quit
q(save="no")

For more information, see the installation wiki page

Inputs

GeneGrouper has two required inputs:

A translated gene sequence in fasta format (with file extension .fasta/.txt)
A folder containing RefSeq GenBank-format genomes (with the file extension .gbff). See instructions to download many RefSeq genomes at a time.

Basic usage

Use `build_database` to make a GeneGrouper database of your RefSeq .gbff genomes

GeneGrouper -g /path/to/gbff -d /path/to/main_directory \
build_database

Use `find_regions` to search for regions containing a gene of interest and output to a search-specific directory

GeneGrouper -d /path/to/main_directory -n gene_search \
find_regions \
-f /path/to/query_gene.fasta

Use `visualize --visual_type main` to output visualizations of group gene architectures and their distribution within genomes and taxa

GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type main

Use `visualize --visual_type group` to inspect a GeneGrouper group more closely. Replace <> with a group ID number.

GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type group <>

Use `visualize --visual_type tree` to make a phylogenetic tree of each group's seed gene

GeneGrouper -d /path/to/main_directory -n gene_search \
visualize \
--visual_type tree

See advanced usage examples

See tutorial with provided example data

Outputs

For each search find_regions outputs:

Four tabular files with quantitative and qualitative descriptions of grouping results.
One fasta file containing all genes used in the analysis.

For each search, visualize --visual_type main outputs:

Three main visualizations provided.

For each search, visualize --visual_type group \--group_label <n> outputs:

One additional visualization per group, where --group_label <n> has <n> replaced with the group number.
Two tabular files containing subgroup information for each --group_label <n> supplied.

For each search, visualize --visual_type tree outputs:

One phylogenetic tree of each seed gene in each group.

See complete output file descriptions

Each search and visualization will have the following file structure. Files under visualizations may differ.

├── main_directory
│   ├── search_results
│   │   ├── group_statistics_summmary.csv
│   │   ├── representative_group_member_summary.csv
│   │   ├── group_taxa_summary.csv
│   │   ├── group_regions.csv
│   │   ├── group_region_seqs.faa
│   │   ├── visualizations
│   │   │   ├── group_summary.png
│   │   │   ├── groups_by_taxa.png
│   │   │   ├── taxa_searched.png
│   │   │   ├── inspect_group_-1.png
│   │   │   ├── representative_seed_phylogeny.png
│   │   ├── internal_data
│   │   ├── subgroups
│   │   ├── seed_results.db

Usage options

Global flags

usage: GeneGrouper [-h] [-d] [-n] [-g] [-t]
                   {build_database,find_regions,visualize} ...

  -h, --help            show this help message and exit
  -d , --project_directory
                        Main directory to contain the base files used for
                        region searching and clustering. Default=current
                        directory.
  -n , --search_name    Name of the directory to contain search-specific
                        results. Default=region_search
  -g , --genomes_directory
                        Directory containing genbank-file format genomes with
                        the suffix .gbff. Default=./genomes.
  -t , --threads        Number of threads to use. Default=all threads.

Subcommands

    build_database      Convert a set of genomes into a useable format for
                        GeneGrouper
    find_regions        Find regions given a translated gene and a set of
                        genomes
    visualize           Visualize GeneGrouper outputs. Three visualization options are provided.
                        Check the --visual_type help description.

Subcommand flags

build_database

usage: GeneGrouper build_database [-h]

  -h, --help  show this help message and exit

find_regions

usage: GeneGrouper find_regions [-h] -f  [-us] [-ds] [-i] [-c] [-hk] [--min_group_size] [-re] [--force]

  -h, --help            show this help message and exit
  -f , --query_file     Provide the absolute path to a fasta file containing a translated gene sequence.
  -us , --upstream_search
                        Upstream search length in basepairs. Default=10000
  -ds , --downstream_search
                        Downstream search length in basepairs. Default=10000
  -i , --seed_identity
                        Identity cutoff for initial blast search. Default=60
  -c , --seed_coverage
                        Coverage cutoff for initial blast search. Default=90
  -hk , --seed_hits_kept
                        Number of blast hits to keep. Default=None
  --min_group_size
                        The minimum number of gene regions to constitute a group. Default=ln(jaccard distance length)
  -re , --recluster_iterations
                        Number of region re-clustering attempts after the initial clustering. Default=0
  --force               Flag to overwrite search name directory.

visualize

usage: GeneGrouper visualize [-h] [--visual_type] [--group_label]

  --visual_type      Choices: [main, group, tree]. Use main for main visualizations. Use group to
                     inspect specific group. Use tree for a phylogenetic tree of representative
                     seed sequencess. Default=main
  --group_label      The integer identifier of the group you wish to inspect. Default=-1
  --image_format     Choices: [png, svg]. Output image format. Use svg if you want to edit the
                     images. Default=png.
  --tip_label_type   Choices: [full, group]. Use full to include the sequence ID followed by group
                     ID. Use group to only have the group ID. Default=full
  --tip_label_size   Specify the tip label size in the output image. Default=2

Citation

Alexander G McFarland, Nolan W Kennedy, Carolyn E Mills, Danielle Tullman-Ercek, Curtis Huttenhower, Erica M Hartmann, Density-based binning of gene clusters to infer function or evolutionary history using GeneGrouper, Bioinformatics, 2021;, btab752, https://doi.org/10.1093/bioinformatics/btab752

Contact

Please message me at alexandermcfarland2022@u.northwestern.edu

Follow me on twitter @alexmcfarland_!

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

1.0.3

Feb 12, 2022

1.0.2

Nov 18, 2021

1.0.1

Oct 13, 2021

1.0.0

Oct 11, 2021

0.0.3

Jun 24, 2021

0.0.2

Jun 18, 2021

0.0.1

May 28, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GeneGrouper-1.0.3.tar.gz (36.3 kB view details)

Uploaded Feb 12, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

GeneGrouper-1.0.3-py3-none-any.whl (42.0 kB view details)

Uploaded Feb 12, 2022 Python 3

File details

Details for the file GeneGrouper-1.0.3.tar.gz.

File metadata

Download URL: GeneGrouper-1.0.3.tar.gz
Upload date: Feb 12, 2022
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.0

File hashes

Hashes for GeneGrouper-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`f8eb0fb6f0bb19558ec49879612ee39a88aab9be68defab28d5b4ed6d6f90cec`
MD5	`335a17ebf09267c83ce8ca658b222657`
BLAKE2b-256	`269ba432e0124b851931e00c00871b667a06f318bc23c46edab6fb7eb24a6c64`

See more details on using hashes here.

File details

Details for the file GeneGrouper-1.0.3-py3-none-any.whl.

File metadata

Download URL: GeneGrouper-1.0.3-py3-none-any.whl
Upload date: Feb 12, 2022
Size: 42.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.0

File hashes

Hashes for GeneGrouper-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6a9eaa60168ff2df05269566f13e00f1b72a0651a639ce00797999796dfd8ed`
MD5	`293242f621ea050d0f8178b86c11cc7b`
BLAKE2b-256	`1dc3fd3b2399781722a31bcb6b46f702d996e2fdc5f2b1ee8a096d3c89d2790d`

See more details on using hashes here.

GeneGrouper 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Inputs

GeneGrouper has two required inputs:

Basic usage

Use build_database to make a GeneGrouper database of your RefSeq .gbff genomes

Use find_regions to search for regions containing a gene of interest and output to a search-specific directory

Use visualize --visual_type main to output visualizations of group gene architectures and their distribution within genomes and taxa

Use visualize --visual_type group to inspect a GeneGrouper group more closely. Replace <> with a group ID number.

Use visualize --visual_type tree to make a phylogenetic tree of each group's seed gene

Outputs

Usage options

Global flags

Subcommands

Subcommand flags

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Use `build_database` to make a GeneGrouper database of your RefSeq .gbff genomes

Use `find_regions` to search for regions containing a gene of interest and output to a search-specific directory

Use `visualize --visual_type main` to output visualizations of group gene architectures and their distribution within genomes and taxa

Use `visualize --visual_type group` to inspect a GeneGrouper group more closely. Replace <> with a group ID number.

Use `visualize --visual_type tree` to make a phylogenetic tree of each group's seed gene