Skip to main content

PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.

Project description

PyamilySeq - !BETA!

PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2. This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).

Features

  • End-to-End: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
  • Clustering: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
  • Reclustering: Allows for the addition of new sequences post-initial clustering.
  • Output: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
    • Align representative sequences using MAFFT.
    • Output concatenated aligned sequences for downstream analysis.
    • Optionally output sequences of identified families.

Installation

PyamilySeq requires Python 3.6 or higher. Install using pip:

pip install PyamilySeq

Usage - Menu

usage: PyamilySeq.py [-h] -run_mode {Full,Partial} -group_mode {Species,Genus}
                     -clust_tool {CD-HIT} -output_dir OUTPUT_DIR
                     [-input_type {separate,combined}] [-input_dir INPUT_DIR]
                     [-name_split NAME_SPLIT] [-pid PIDENT]
                     [-len_diff LEN_DIFF] [-cluster_file CLUSTER_FILE]
                     [-reclustered RECLUSTERED] [-seq_tag SEQUENCE_TAG]
                     [-groups CORE_GROUPS] [-w WRITE_FAMILIES] [-con CON_CORE]
                     [-original_fasta ORIGINAL_FASTA]
                     [-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}]
                     [-v]

PyamilySeq v0.5.1: PyamilySeq Run Parameters.

options:
  -h, --help            show this help message and exit

Required Arguments:
  -run_mode {Full,Partial}
                        Run Mode: Should PyamilySeq be run in "Full" or
                        "Partial" mode?
  -group_mode {Species}
                        Group Mode: Should PyamilySeq be run in "Species" or
                        "Genus" mode?  - Genus mode not currently functioning
  -clust_tool {CD-HIT}  Clustering tool to use: CD-HIT, DIAMOND, BLAST or
                        MMseqs2.
  -output_dir OUTPUT_DIR
                        Directory for all output files.

Full-Mode Arguments - Required when "-run_mode Full" is used:
  -input_type {separate,combined}
                        Type of input files: 'separate' for separate FASTA and
                        GFF files, 'combined' for GFF files with embedded
                        FASTA sequences.
  -input_dir INPUT_DIR  Directory containing GFF/FASTA files.
  -name_split NAME_SPLIT
                        substring used to split the filename and extract the
                        genome name ('_combined.gff3' or '.gff').
  -pid PIDENT           Default 0.95: Pident threshold for clustering.
  -len_diff LEN_DIFF    Default 0.80: Minimum length difference between
                        clustered sequences - (-s) threshold for CD-HIT
                        clustering.

Partial-Mode Arguments - Required when "-run_mode Partial" is used:
  -cluster_file CLUSTER_FILE
                        Clustering output file containing CD-HIT, TSV or CSV
                        Edge List

Grouping Arguments - Use to fine-tune grouping of genes after clustering:
  -reclustered RECLUSTERED
                        Clustering output file from secondary round of
                        clustering
  -seq_tag SEQUENCE_TAG
                        Default - "StORF": Unique identifier to be used to
                        distinguish the second of two rounds of clustered
                        sequences
  -groups CORE_GROUPS   Default - ('99,95,15'): Gene family groups to use

Output Parameters:
  -w WRITE_FAMILIES     Default - No output: Output sequences of identified
                        families (provide levels at which to output "-w 99,95"
                        - Must provide FASTA file with -fasta
  -con CON_CORE         Default - No output: Output aligned and concatinated
                        sequences of identified families - used for MSA
                        (provide levels at which to output "-w 99,95" - Must
                        provide FASTA file with -fasta
  -original_fasta ORIGINAL_FASTA
                        FASTA file to use in conjunction with "-w" or "-con"
                        when running in Partial Mode.
  -gpa GENE_PRESENCE_ABSENCE_OUT
                        Default - False: If selected, a Roary formatted
                        gene_presence_absence.csv will be created - Required
                        for Coinfinder and other downstream tools

Misc:
  -verbose {True,False}
                        Default - False: Print out runtime messages
  -v                    Default - False: Print out version number and exit

Examples: Below are two examples of running PyamilySeq in its two main modes.

'Full Mode': Will conduct clustering of sequences as part of PyamilySeq run

PyamilySeq -run_mode Full -group_mode Species -output_dir ../../test_data/testing -input_type combined -input_dir .../test_data/genomes -name_split _combined.gff3 -pid 0.99 -len_diff 0.99 -clust_tool CD-HIT -gpa True -con True -w 99 -verbose True

'Partial Mode': Will take the output of a sequence clustering

PyamilySeq -run_mode Partial -group_mode Species -output_dir .../test_data/testing -cluster_file .../test_data/CD-HIT/combined_Ensmbl_pep_CD_90_60.clstr -clust_tool CD-HIT -original_fasta .../test_data/combined_Ensmbl_cds.fasta -gpa True -con True -w 99 -verbose True
Calculating Groups
Gene Groups:
first_core_99: 3103
first_core_95: 0
first_core_15: 3217
first_core_0: 4808
Total Number of Gene Groups (Including Singletons): 11128

Seq-Combiner: This tool is provided to enable the pre-processing of multiple GFF/FASTA files together ready to be clustered by the user

Example:

Seq-Combiner -input_dir .../test_data/genomes -name_split _combined.gff3 -output_dir.../test_data -output_name combine_fasta_seqs.fa -input_type combined

Seq-Combiner Menu:

usage: Seq_Combiner.py [-h] -input_dir INPUT_DIR -input_type {separate,combined} -name_split NAME_SPLIT -output_dir OUTPUT_DIR -output_name OUTPUT_FILE

Seq-Combiner v0.5.1: Seq-Combiner Run Parameters.

options:
  -h, --help            show this help message and exit

Required Arguments:
  -input_dir INPUT_DIR  Directory location where the files are located.
  -input_type {separate,combined}
                        Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
  -name_split NAME_SPLIT
                        substring used to split the filename and extract the genome name ('_combined.gff3' or '.gff').
  -output_dir OUTPUT_DIR
                        Directory for all output files.
  -output_name OUTPUT_FILE
                        Output file name.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyamilyseq-0.5.1.tar.gz (38.0 kB view hashes)

Uploaded Source

Built Distribution

PyamilySeq-0.5.1-py3-none-any.whl (39.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page