Species tree construction from marker gene phylogenies
Project description
SGTree
SGTree is an end-to-end workflow for phylogenetic tree building. Use the provided sets of HMMs or provide your own HMMs to find the proteins of interest. SGTree then performs gene tree to approximate species tree reconciliation to select the most likely correct copy of a protein in case of duplications (paralogs, contamination).
Setup
Install the Pixi environment:
pixi install
The environment is managed through pixi.toml only.
Run
Primary interface (Nextflow):
pixi run sgtree --help
Basic run:
pixi run sgtree \
--genomedir <path to dir with protein faa files, one faa file per genome> \
--modeldir <path to marker set .hmm>
Example run:
pixi run sgtree \
--genomedir testgenomes/Chloroflexi \
--modeldir resources/models/UNI56.hmm
Marker-selection run with references and singleton filtering:
pixi run sgtree \
--genomedir testgenomes/Chloroflexi \
--modeldir resources/models/UNI56.hmm \
--outdir runs/nextflow/manual_full \
--marker_selection true \
--ref testgenomes/chlorref \
--singles yes
pixi run sgtree writes logs automatically to runs/nextflow/logs/.
Marker searches and --aln hmmalign are run with pyhmmer (HMMER-compatible search output).
Example with IQ-TREE and explicit HMM threshold mode:
pixi run sgtree \
--genomedir testgenomes/Chloroflexi \
--modeldir resources/models/UNI56.hmm \
--tree_method iqtree \
--iqtree_fast true \
--hmmsearch_cutoff cut_ga
Second choice (Python implementation without nextflow):
pixi run sgtree-python testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8
Backward-compatible wrapper:
pixi run python ./bin/sgtree_wrapper.py testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8
Settings
Core method controls:
--aln:hmmalign,mafft, ormafft-linsi(defaulthmmalign).--tree_method:fasttreeoriqtree(defaultfasttree) for both species tree and per-marker trees.--iqtree_fast: apply-fastwhen--tree_method iqtree(defaulttrue).--iqtree_model: IQ-TREE model string (defaultLG+F+I+G4).
HMM search thresholds:
--hmmsearch_cutoff cut_ga: use model gathering cutoffs (recommended for curated marker sets such as UNI56).--hmmsearch_cutoff cut_tc: use model trusted cutoffs.--hmmsearch_cutoff cut_nc: use model noise cutoffs.--hmmsearch_cutoff evalue --hmmsearch_evalue <float>: use a plain E-value threshold.
Genome inclusion/exclusion criteria:
--percent_models(default10): minimum fraction of markers detected per genome.--max_sdup(default-1): maximum allowed copies of any single marker in one genome;-1disables.--max_dupl(default-1): maximum allowed fraction of markers present in multiple copies;-1disables.--lflt(default0): optional per-marker length filter (% of median hit length).--num_nei(default0): optional singleton-removal neighbor count override (0keeps auto mode).
nsgtree-style mapping:
minmarker->--percent_models(fraction mapped to percent).maxsdup->--max_sdup.maxdupl->--max_dupl.hmmsearch_cutoff->--hmmsearch_cutoffand--hmmsearch_evalue.tmethod->--tree_method.iq_*model controls ->--iqtree_model(and--iqtree_fast).mafftv/mafft->--aln mafftor--aln mafft-linsi(or--aln hmmalign).
Practical selection guide:
- Curated marker sets (for example UNI56): start with
--hmmsearch_cutoff cut_ga. - Less curated/custom marker sets: start with
--hmmsearch_cutoff evalue --hmmsearch_evalue 1e-5, then tighten if false positives appear. --aln hmmalignis the fastest stable default and keeps alignment behavior tied to each profile HMM.--aln mafft-linsiis slower but can help when marker-specific profile alignment is not desired.--tree_method fasttreeis the quick default;--tree_method iqtree --iqtree_fast trueis a practical higher-accuracy option.- Typical inclusion presets:
- Balanced:
--percent_models 10 --max_sdup 2 --max_dupl 0.25 - Strict:
--percent_models 30 --max_sdup 1 --max_dupl 0.10 - Relaxed:
--percent_models 5 --max_sdup -1 --max_dupl -1
Input Requirements
Proteomes must be FASTA (*.faa). SGTree now normalizes all inputs internally to:
>IMG2684622718|2685462912
MLCAFAEEEAKIAETVGKVATELKVKKLLSDFATKEGEEHISTYNKIAMTAKAEGYADIEAMLCAFAEEEAKLQKL
Normalization behavior:
- Directory input (
--genomedir <dir>): one proteome per*.faa; genome id is derived from filename stem. - Single FASTA input (
--genomedir <file>): if headers already containgenome|protein, the genome part is preserved. - Headers and IDs are sanitized to avoid delimiter collisions.
- Malformed header joins (for example
...*>next_header) are repaired before parsing. - Invalid amino-acid characters are replaced with
X;*is removed. - Header mapping is written as
proteomes_header_map_<input>.tsvin--outdir.
Output Structure
Nextflow output (--outdir):
<outdir>/
tree.nwk
tree_final.nwk # marker-selection mode
tree_final.png # marker-selection mode
marker_count_matrix.csv
marker_count.txt # basic mode
marker_counts.txt # marker-selection mode
marker_selection_rf_values.txt # marker-selection mode
color.txt
log_genomes_removed.txt
proteomes_header_map_<input>.tsv
Python output (--save_dir):
<save_dir>/
tree.nwk or tree_final.nwk
tree_final.png # marker-selection mode
marker_count_matrix.csv
marker_selection_rf_values.txt # marker-selection mode
log_genomes_removed.txt
logfile_*.txt
temp/
*.zip
itol/
Repository Structure
sgtree/
sgtree/ # Python package implementation
bin/sgtree_wrapper.py # backward-compatible wrapper
main.nf # Nextflow entrypoint
workflows/ # DSL2 workflow composition
modules/ # DSL2 process modules
bin/ # helper scripts and launch wrappers
tests/
regression_parity.py # cross-engine parity checks
resources/
models/ # combined marker-set HMM files
testgenomes/ # example query/reference data
runs/ # runtime outputs/work/logs (.gitkeep tracked)
pixi.toml # reproducible environment + tasks
nextflow.config # runtime defaults and CPU settings
Workflow
+-------------------+
| Input Proteomes |
| + HMM Models |
+---------+---------+
|
v
+--------+--------+
| HMMSEARCH |
+--------+--------+
|
v
+--------+--------+
| PARSE_HMMSEARCH |
| marker matrix |
+--------+--------+
|
v
+--------+--------+
| EXTRACT_SEQS |
+--------+--------+
|
v
+--------+--------+
| ALIGN (hmmalign/|
| mafft/linsi) |
+--------+--------+
|
v
+--------+--------+
| ELIM_DUPLICATES |
+--------+--------+
|
v
+--------+--------+
| TRIMAL |
+--------+--------+
|
v
+--------+--------+
| BUILD_SUPERMATRIX|
+--------+--------+
|
v
+--------+--------+
| TREE_BUILDER |
| tree.nwk |
+--------+--------+
|
marker_selection?
/ \
no yes
| |
v v
+-----+-----+ +-------+--------+
| iTOL TXT | | per-marker |
| marker_* | | TRIMAL+TREEBLD |
+-----------+ +-------+--------+
|
v
+------+------+
| RF_SELECTION|
+------+------+
|
singles?|
/ \
no yes
| |
v v
+--------+---+ +---+--------+
| WRITE_CLEAN | |REMOVE_ |
| ALIGNMENTS | |SINGLES |
+--------+----+ +---+--------+
\ /
\ /
v v
+--+------+
|TRIMAL_FINAL
+--+------+
|
v
+----+-----+
|SUPERMATRIX|
+----+-----+
|
v
+----+-----+
|TREE_BUILDER|
|tree_final |
+----+-----+
|
v
+----------+-----------+
| tree_final.png |
| marker_counts.txt |
| marker_selection_rf |
+----------------------+
Repository Hygiene
Use this command for a clean runtime workspace between runs:
pixi run clean-runtime
Authors and Contributors
| Author | Date | |
|---|---|---|
| Ewan Whittaker-Walker | ewanww@berkeley.edu | 05/19/2019 |
| Frederik Schulz | fschulz@lbl.gov | Since 2019 |
| Juan C. Villada | jvillada@lbl.gov | Since 2021 |
| Marianne Buscaglia | mbuscaglia@lbl.gov | Since 2022 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file astrogenomics_sgtree-2.0.0.tar.gz.
File metadata
- Download URL: astrogenomics_sgtree-2.0.0.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07b0d849d93ce61887d8494b1156c3e00af8a0ee9da2b81f5d5d85e044f1e9f2
|
|
| MD5 |
e2392517d00d8c72e0e801263e6fb738
|
|
| BLAKE2b-256 |
05a0b679a9e1f2c918630a283f6008e4b2a3f1912b88dfcf866eb10846df2493
|
File details
Details for the file astrogenomics_sgtree-2.0.0-py3-none-any.whl.
File metadata
- Download URL: astrogenomics_sgtree-2.0.0-py3-none-any.whl
- Upload date:
- Size: 32.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fb9865020b92ae71866f42d0c4820ea29db33f54974495ebb6fb2753ecb54d3
|
|
| MD5 |
503fa62364ba2b934cbc2e22c80e661f
|
|
| BLAKE2b-256 |
cb9825a63d29d0dd446355ac2a83dbd4126b4fd7d1a7e7b8d217bd82fbd60646
|