Find and cluster genomic regions containing a seed gene

These details have not been verified by PyPI

Project links

GitHub Statistics

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

GeneGrouper is a command-line tool that searches a set of genomes for gene clusters containing a gene of interest. All gene clusters are then binned into groups according to their similarity in gene content. Qualitative and quantitative outputs provide a population-level view of how gene cluster groups are distributed and how varied gene content is within a group.

See detailed explanation of overview figure

See an example application of GeneGrouper

See FAQs

Installation

pip install GeneGrouper

GeneGrouper has multiple dependences. Please make sure that you install them manually or follow our simple guide to create a self-contained conda environment for GeneGrouper.

See dependencies

See creating a conda environment with all GeneGrouper dependencies installed Recommended

Inputs

GeneGrouper has two required inputs:

A translated gene sequence in fasta format (with file extension .fasta/.txt)
A folder containing RefSeq GenBank-format genomes (with the file extension .gbff). See options for how to download many RefSeq genomes at a time.

Basic usage

Use `build_database` to make a GeneGrouper database of your RefSeq .gbff genomes

GeneGrouper -g /path/to/gbff -d /path/to/main_directory \
build_database

Use `find_regions` to search for regions containing a gene of interest and output to a search-specific directory

GeneGrouper -d /path/to/main_directory -n search_results \
find_regions \
-f /path/to/query_gene.fasta

Use `visualize` to output visualizations of group gene architectures and their distribution within genomes and taxa

GeneGrouper -d /path/to/main_directory -n search_results \
visualize

Use `visualize --visual_type group` to inspect a GeneGrouper group more closely

GeneGrouper -d /path/to/main_directory -n search_results \
visualize \
--visual_type group

See advanced usage examples

See tutorial with provided example data

Outputs

Each region_search outputs:

Four tabular files with quantitative and qualitative descriptions of grouping results.

One fasta file containing all genes used in the analysis.

visualize outputs:

Three main visualizations provided if the visualize --visual_type main command is used.

One additional visualization per group provided the visualize --visual_type group group_label <n> is supplied, where <n> is the name of the group label.

See complete output file descriptions

Each search and visualization will have the following file structure. Files under visualizations may differ.

├── main_directory
│   ├── search_results
│   │   ├── group_statistics_summmary.csv
│   │   ├── representative_group_member_summary.csv
│   │   ├── group_taxa_summary.csv
│   │   ├── group_regions.csv
│   │   ├── group_region_seqs.faa
│   │   ├── visualizations
│   │   │   ├── group_summary.png
│   │   │   ├── groups_by_taxa.png
│   │   │   ├── taxa_searched.png
│   │   │   ├── inspect_group_-1.png
│   │   │   ├── representative_seed_phylogeny.png
│   │   ├── internal_data
│   │   ├── seed_results.db

Usage options

Global flags

usage: GeneGrouper [-h] [-d] [-n] [-g] [-t]
                   {build_database,find_regions,visualize} ...

  -h, --help            show this help message and exit
  -d , --project_directory
                        Main directory to contain the base files used for
                        region searching and clustering. Default=current
                        directory.
  -n , --search_name    Name of the directory to contain search-specific
                        results. Default=region_search
  -g , --genomes_directory
                        Directory containing genbank-file format genomes with
                        the suffix .gbff. Default=./genomes.
  -t , --threads        Number of threads to use. Default=all threads.

Subcommands

    build_database      Convert a set of genomes into a useable format for
                        GeneGrouper
    find_regions        Find regions given a translated gene and a set of
                        genomes
    visualize           Visualize GeneGrouper outputs. Three visualization options are provided.
                        Check the --visual_type help description.

Subcommand flags

build_database

usage: GeneGrouper build_database [-h]

  -h, --help  show this help message and exit

find_regions

usage: GeneGrouper find_regions [-h] -f  [-us] [-ds] [-i] [-c] [-hk] [--min_group_size] [-re] [--force]

  -h, --help            show this help message and exit
  -f , --query_file     Provide the absolute path to a fasta file containing a translated gene sequence.
  -us , --upstream_search
                        Upstream search length in basepairs. Default=10000
  -ds , --downstream_search
                        Downstream search length in basepairs. Default=10000
  -i , --seed_identity
                        Identity cutoff for initial blast search. Default=60
  -c , --seed_coverage
                        Coverage cutoff for initial blast search. Default=90
  -hk , --seed_hits_kept
                        Number of blast hits to keep. Default=None
  --min_group_size
                        The minimum number of gene regions to constitute a group. Default=ln(jaccard distance length)
  -re , --recluster_iterations
                        Number of region re-clustering attempts after the initial clustering. Default=0
  --force               Flag to overwrite search name directory.

visualize

usage: GeneGrouper visualize [-h] [--visual_type] [--group_label]

  --visual_type      Choices: [main, group, tree]. Use main for main visualizations. Use group to
                     inspect specific group. Use tree for a phylogenetic tree of representative
                     seed sequencess. Default=main
  --group_label      The integer identifier of the group you wish to inspect. Default=-1
  --image_format     Choices: [png, svg]. Output image format. Use svg if you want to edit the
                     images. Default=png.
  --tip_label_type   Choices: [full, group]. Use full to include the sequence ID followed by group
                     ID. Use group to only have the group ID. Default=full
  --tip_label_size   Specify the tip label size in the output image. Default=2

Citation

Density-based binning of gene clusters to infer function or evolutionary history using GeneGrouper

Alexander G McFarland, Nolan W Kennedy, Carolyn E Mills, Danielle Tullman-Ercek, Curtis Huttenhower, Erica M Hartmann

bioRxiv 2021.05.27.446007; doi: https://doi.org/10.1101/2021.05.27.446007

Contact

Feel free to message me at alexandermcfarland2022@u.northwestern.edu or follow me on twitter @alexmcfarland_!

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.0.3

Feb 12, 2022

1.0.2

Nov 18, 2021

This version

1.0.1

Oct 13, 2021

1.0.0

Oct 11, 2021

0.0.3

Jun 24, 2021

0.0.2

Jun 18, 2021

0.0.1

May 28, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GeneGrouper-1.0.1.tar.gz (34.6 kB view hashes)

Uploaded Oct 13, 2021 Source

Built Distribution

GeneGrouper-1.0.1-py3-none-any.whl (40.7 kB view hashes)

Uploaded Oct 13, 2021 Python 3

Hashes for GeneGrouper-1.0.1.tar.gz

Hashes for GeneGrouper-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`e552120a5a4de618afac21ef5eada39bc23c199f8f25ddb2f18a93c37e8d20b4`
MD5	`15a3ede061afe313d9be970c4d7f2e5e`
BLAKE2b-256	`ba0437c0d741e4a0c26919727091d817d1dc07ef46e69523944c290e2cc2c6c2`

Hashes for GeneGrouper-1.0.1-py3-none-any.whl

Hashes for GeneGrouper-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`046e1d0997718164459a6ff3c7a5ef8cfa8212343352cfafd05eb2bfd3da3e0d`
MD5	`a98643a0816f1d1eacfe5577db395683`
BLAKE2b-256	`8562f4fe7171fd1c8d4b1552a33c761ae59f8a0c0feae330f6a485f101039f34`

GeneGrouper 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Installation

Inputs

GeneGrouper has two required inputs:

Basic usage

Use `build_database` to make a GeneGrouper database of your RefSeq .gbff genomes

Use `find_regions` to search for regions containing a gene of interest and output to a search-specific directory

Use `visualize` to output visualizations of group gene architectures and their distribution within genomes and taxa

Use `visualize --visual_type group` to inspect a GeneGrouper group more closely

Outputs

Usage options

Global flags

Subcommands

Subcommand flags

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

GeneGrouper 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Installation

Inputs

GeneGrouper has two required inputs:

Basic usage

Use build_database to make a GeneGrouper database of your RefSeq .gbff genomes

Use find_regions to search for regions containing a gene of interest and output to a search-specific directory

Use visualize to output visualizations of group gene architectures and their distribution within genomes and taxa

Use visualize --visual_type group to inspect a GeneGrouper group more closely

Outputs

Usage options

Global flags

Subcommands

Subcommand flags

Citation

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Use `build_database` to make a GeneGrouper database of your RefSeq .gbff genomes

Use `find_regions` to search for regions containing a gene of interest and output to a search-specific directory

Use `visualize` to output visualizations of group gene architectures and their distribution within genomes and taxa

Use `visualize --visual_type group` to inspect a GeneGrouper group more closely