PyamilySeq - A a tool to look for sequence-based gene families identified by clustering methods such as CD-HIT, DIAMOND, BLAST or MMseqs2.
Project description
PyamilySeq - !BETA!
PyamilySeq (Family Seek) is a Python tool for clustering gene sequences into families based on sequence similarity identified by tools such as CD-HIT, BLAST, DIAMOND or MMseqs2. This work is an extension of the gene family / pangenome tool developed for the StORF-Reporter publication in NAR (https://doi.org/10.1093/nar/gkad814).
Features
- End-to-End: PyamilySeq can take a directory of GFF+FASTA files, run CD-HIT for clustering and process the results.
- Clustering: Supports input from CD-HIT formatted files as well as CSV and TSV edge lists (-outfmt 6 from BLAST/DIAMOND).
- Reclustering: Allows for the addition of new sequences post-initial clustering.
- Output: Generates a gene 'Roary/Panaroo' formatted presence-absence CSV formatted file for downstream analysis.
- Align representative sequences using MAFFT.
- Output concatenated aligned sequences for downstream analysis.
- Optionally output sequences of identified families.
Installation
PyamilySeq requires Python 3.6 or higher. Install using pip:
pip install PyamilySeq
Usage - Menu
usage: PyamilySeq.py [-h] -run_mode {Full,Partial} -group_mode {Species,Genus}
-clust_tool {CD-HIT} -output_dir OUTPUT_DIR
[-input_type {separate,combined}] [-input_dir INPUT_DIR]
[-name_split NAME_SPLIT] [-pid PIDENT]
[-len_diff LEN_DIFF] [-cluster_file CLUSTER_FILE]
[-reclustered RECLUSTERED] [-seq_tag SEQUENCE_TAG]
[-groups CORE_GROUPS] [-w WRITE_FAMILIES] [-con CON_CORE]
[-original_fasta ORIGINAL_FASTA]
[-gpa GENE_PRESENCE_ABSENCE_OUT] [-verbose {True,False}]
[-v]
PyamilySeq v0.5.0: PyamilySeq Run Parameters.
options:
-h, --help show this help message and exit
Required Arguments:
-run_mode {Full,Partial}
Run Mode: Should PyamilySeq be run in "Full" or
"Partial" mode?
-group_mode {Species,Genus}
Group Mode: Should PyamilySeq be run in "Species" or
"Genus" mode?
-clust_tool {CD-HIT} Clustering tool to use: CD-HIT, DIAMOND, BLAST or
MMseqs2.
-output_dir OUTPUT_DIR
Directory for all output files.
Full-Mode Arguments - Required when "-run_mode Full" is used:
-input_type {separate,combined}
Type of input files: 'separate' for separate FASTA and
GFF files, 'combined' for GFF files with embedded
FASTA sequences.
-input_dir INPUT_DIR Directory containing GFF/FASTA files.
-name_split NAME_SPLIT
substring used to split the filename and extract the
genome name ('_combined.gff3' or '.gff').
-pid PIDENT Default 0.95: Pident threshold for clustering.
-len_diff LEN_DIFF Default 0.80: Minimum length difference between
clustered sequences - (-s) threshold for CD-HIT
clustering.
Partial-Mode Arguments - Required when "-run_mode Partial" is used:
-cluster_file CLUSTER_FILE
Clustering output file containing CD-HIT, TSV or CSV
Edge List
Grouping Arguments - Use to fine-tune grouping of genes after clustering:
-reclustered RECLUSTERED
Clustering output file from secondary round of
clustering
-seq_tag SEQUENCE_TAG
Default - "StORF": Unique identifier to be used to
distinguish the second of two rounds of clustered
sequences
-groups CORE_GROUPS Default - ('99,95,15'): Gene family groups to use
Output Parameters:
-w WRITE_FAMILIES Default - No output: Output sequences of identified
families (provide levels at which to output "-w 99,95"
- Must provide FASTA file with -fasta
-con CON_CORE Default - No output: Output aligned and concatinated
sequences of identified families - used for MSA
(provide levels at which to output "-w 99,95" - Must
provide FASTA file with -fasta
-original_fasta ORIGINAL_FASTA
FASTA file to use in conjunction with "-w" or "-con"
when running in Partial Mode.
-gpa GENE_PRESENCE_ABSENCE_OUT
Default - False: If selected, a Roary formatted
gene_presence_absence.csv will be created - Required
for Coinfinder and other downstream tools
Misc:
-verbose {True,False}
Default - False: Print out runtime messages
-v Default - False: Print out version number and exit
Examples: Below are two examples of running PyamilySeq in its two main modes.
'Full Mode': Will conduct clustering of sequences as part of PyamilySeq run
PyamilySeq -id .../genomes -it combined -ns _combined.gff3 -pid 0.90 -ld 0.60 -co testing_cd-hit -ct CD-HIT -od .../testing
'Partial Mode': Will take the output of a sequence clustering
PyamilySeq -run_mode Partial -group_mode Species -output_dir .../test_data/testing -cluster_file .../test_data/CD-HIT/combined_Ensmbl_pep_CD_90_60.clstr -clust_tool CD-HIT -original_fasta .../test_data/combined_Ensmbl_cds.fasta -gpa True -con True -w 99 -verbose True
Calculating Groups
Gene Groups:
first_core_99: 3103
first_core_95: 0
first_core_15: 3217
first_core_0: 4808
Total Number of Gene Groups (Including Singletons): 11128
Seq-Combiner: This tool is provided to enable the pre-processing of multiple GFF/FASTA files together ready to be clustered by the user
Example:
Seq-Combiner -input_dir .../test_data/genomes -name_split _combined.gff3 -output_dir.../test_data -output_name combine_fasta_seqs.fa -input_type combined
Seq-Combiner Menu:
usage: Seq_Combiner.py [-h] -input_dir INPUT_DIR -input_type {separate,combined} -name_split NAME_SPLIT -output_dir OUTPUT_DIR -output_name OUTPUT_FILE
Seq-Combiner v0.5.0: Seq-Combiner Run Parameters.
options:
-h, --help show this help message and exit
Required Arguments:
-input_dir INPUT_DIR Directory location where the files are located.
-input_type {separate,combined}
Type of input files: 'separate' for separate FASTA and GFF files, 'combined' for GFF files with embedded FASTA sequences.
-name_split NAME_SPLIT
substring used to split the filename and extract the genome name ('_combined.gff3' or '.gff').
-output_dir OUTPUT_DIR
Directory for all output files.
-output_name OUTPUT_FILE
Output file name.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyamilyseq-0.5.0.tar.gz
(37.9 kB
view hashes)
Built Distribution
PyamilySeq-0.5.0-py3-none-any.whl
(39.0 kB
view hashes)
Close
Hashes for PyamilySeq-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb0ec5a08ccd11b8968e43615b1a40dc5f17ea439f2ece288dcfb222b709d885 |
|
MD5 | eff9a9411afe727174d8ad442e75cee8 |
|
BLAKE2b-256 | 30a08703f9788d8426f41e544aa2c21f8e5c6ca22c84831e69107680bab32cca |