Skip to main content

Species tree construction from marker gene phylogenies

Project description

SGTree

SGTree is an end-to-end workflow for phylogenetic tree building. Use the provided sets of HMMs or provide your own HMMs to find the proteins of interest. SGTree then performs gene tree to approximate species tree reconciliation to select the most likely correct copy of a protein in case of duplications (paralogs, contamination).

Setup

Install the Pixi environment:

pixi install

The environment is managed through pixi.toml only.

Run

Primary interface (Nextflow):

pixi run sgtree --help

Basic run:

pixi run sgtree \
  --genomedir <path to dir with protein faa files, one faa file per genome> \
  --modeldir <path to marker set .hmm>

Example run:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm

Marker-selection run with references and singleton filtering:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --outdir runs/nextflow/manual_full \
  --marker_selection true \
  --ref testgenomes/chlorref \
  --singles yes

pixi run sgtree writes logs automatically to runs/nextflow/logs/. Marker searches and --aln hmmalign are run with pyhmmer (HMMER-compatible search output).

Example with IQ-TREE and explicit HMM threshold mode:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --tree_method iqtree \
  --iqtree_fast true \
  --hmmsearch_cutoff cut_ga

Second choice (Python implementation without nextflow):

pixi run sgtree-python testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8

Backward-compatible wrapper:

pixi run python ./bin/sgtree_wrapper.py testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8

Settings

Core method controls:

  • --aln: hmmalign, mafft, or mafft-linsi (default hmmalign).
  • --tree_method: fasttree or iqtree (default fasttree) for both species tree and per-marker trees.
  • --iqtree_fast: apply -fast when --tree_method iqtree (default true).
  • --iqtree_model: IQ-TREE model string (default LG+F+I+G4).

HMM search thresholds:

  • --hmmsearch_cutoff cut_ga: use model gathering cutoffs (recommended for curated marker sets such as UNI56).
  • --hmmsearch_cutoff cut_tc: use model trusted cutoffs.
  • --hmmsearch_cutoff cut_nc: use model noise cutoffs.
  • --hmmsearch_cutoff evalue --hmmsearch_evalue <float>: use a plain E-value threshold.

Genome inclusion/exclusion criteria:

  • --percent_models (default 10): minimum fraction of markers detected per genome.
  • --max_sdup (default -1): maximum allowed copies of any single marker in one genome; -1 disables.
  • --max_dupl (default -1): maximum allowed fraction of markers present in multiple copies; -1 disables.
  • --lflt (default 0): optional per-marker length filter (% of median hit length).
  • --num_nei (default 0): optional singleton-removal neighbor count override (0 keeps auto mode).

nsgtree-style mapping:

  • minmarker -> --percent_models (fraction mapped to percent).
  • maxsdup -> --max_sdup.
  • maxdupl -> --max_dupl.
  • hmmsearch_cutoff -> --hmmsearch_cutoff and --hmmsearch_evalue.
  • tmethod -> --tree_method.
  • iq_* model controls -> --iqtree_model (and --iqtree_fast).
  • mafftv/mafft -> --aln mafft or --aln mafft-linsi (or --aln hmmalign).

Practical selection guide:

  • Curated marker sets (for example UNI56): start with --hmmsearch_cutoff cut_ga.
  • Less curated/custom marker sets: start with --hmmsearch_cutoff evalue --hmmsearch_evalue 1e-5, then tighten if false positives appear.
  • --aln hmmalign is the fastest stable default and keeps alignment behavior tied to each profile HMM.
  • --aln mafft-linsi is slower but can help when marker-specific profile alignment is not desired.
  • --tree_method fasttree is the quick default; --tree_method iqtree --iqtree_fast true is a practical higher-accuracy option.
  • Typical inclusion presets:
  • Balanced: --percent_models 10 --max_sdup 2 --max_dupl 0.25
  • Strict: --percent_models 30 --max_sdup 1 --max_dupl 0.10
  • Relaxed: --percent_models 5 --max_sdup -1 --max_dupl -1

Input Requirements

Proteomes must be FASTA (*.faa). SGTree now normalizes all inputs internally to:

>IMG2684622718|2685462912
MLCAFAEEEAKIAETVGKVATELKVKKLLSDFATKEGEEHISTYNKIAMTAKAEGYADIEAMLCAFAEEEAKLQKL

Normalization behavior:

  • Directory input (--genomedir <dir>): one proteome per *.faa; genome id is derived from filename stem.
  • Single FASTA input (--genomedir <file>): if headers already contain genome|protein, the genome part is preserved.
  • Headers and IDs are sanitized to avoid delimiter collisions.
  • Malformed header joins (for example ...*>next_header) are repaired before parsing.
  • Invalid amino-acid characters are replaced with X; * is removed.
  • Header mapping is written as proteomes_header_map_<input>.tsv in --outdir.

Output Structure

Nextflow output (--outdir):

<outdir>/
  tree.nwk
  tree_final.nwk                 # marker-selection mode
  tree_final.png                 # marker-selection mode
  marker_count_matrix.csv
  marker_count.txt               # basic mode
  marker_counts.txt              # marker-selection mode
  marker_selection_rf_values.txt # marker-selection mode
  color.txt
  log_genomes_removed.txt
  proteomes_header_map_<input>.tsv

Python output (--save_dir):

<save_dir>/
  tree.nwk or tree_final.nwk
  tree_final.png                  # marker-selection mode
  marker_count_matrix.csv
  marker_selection_rf_values.txt  # marker-selection mode
  log_genomes_removed.txt
  logfile_*.txt
  temp/
    *.zip
    itol/

Repository Structure

sgtree/
  sgtree/                 # Python package implementation
  bin/sgtree_wrapper.py   # backward-compatible wrapper
  main.nf                 # Nextflow entrypoint
  workflows/              # DSL2 workflow composition
  modules/                # DSL2 process modules
  bin/                    # helper scripts and launch wrappers
  tests/
    regression_parity.py  # cross-engine parity checks
  resources/
    models/               # combined marker-set HMM files
  testgenomes/            # example query/reference data
  runs/                   # runtime outputs/work/logs (.gitkeep tracked)
  pixi.toml               # reproducible environment + tasks
  nextflow.config         # runtime defaults and CPU settings

Workflow

                            +-------------------+
                            |  Input Proteomes  |
                            |  + HMM Models     |
                            +---------+---------+
                                      |
                                      v
                             +--------+--------+
                             |    HMMSEARCH    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | PARSE_HMMSEARCH |
                             | marker matrix   |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | EXTRACT_SEQS    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ALIGN (hmmalign/|
                             | mafft/linsi)    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ELIM_DUPLICATES |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |     TRIMAL      |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | BUILD_SUPERMATRIX|
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |  TREE_BUILDER   |
                             |   tree.nwk      |
                             +--------+--------+
                                      |
                          marker_selection?
                           /            \
                        no               yes
                        |                 |
                        v                 v
                  +-----+-----+   +-------+--------+
                  | iTOL TXT  |   | per-marker     |
                  | marker_*  |   | TRIMAL+TREEBLD |
                  +-----------+   +-------+--------+
                                         |
                                         v
                                  +------+------+
                                  | RF_SELECTION|
                                  +------+------+
                                         |
                                 singles?|
                                  /      \
                               no         yes
                               |           |
                               v           v
                      +--------+---+   +---+--------+
                      | WRITE_CLEAN |   |REMOVE_     |
                      | ALIGNMENTS  |   |SINGLES     |
                      +--------+----+   +---+--------+
                               \           /
                                \         /
                                 v       v
                               +--+------+
                               |TRIMAL_FINAL
                               +--+------+
                                  |
                                  v
                             +----+-----+
                             |SUPERMATRIX|
                             +----+-----+
                                  |
                                  v
                             +----+-----+
                             |TREE_BUILDER|
                             |tree_final |
                             +----+-----+
                                  |
                                  v
                       +----------+-----------+
                       | tree_final.png       |
                       | marker_counts.txt    |
                       | marker_selection_rf  |
                       +----------------------+

Repository Hygiene

Use this command for a clean runtime workspace between runs:

pixi run clean-runtime

Authors and Contributors

Author Email Date
Ewan Whittaker-Walker ewanww@berkeley.edu 05/19/2019
Frederik Schulz fschulz@lbl.gov Since 2019
Juan C. Villada jvillada@lbl.gov Since 2021
Marianne Buscaglia mbuscaglia@lbl.gov Since 2022

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astrogenomics_sgtree-2.0.0.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astrogenomics_sgtree-2.0.0-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file astrogenomics_sgtree-2.0.0.tar.gz.

File metadata

  • Download URL: astrogenomics_sgtree-2.0.0.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for astrogenomics_sgtree-2.0.0.tar.gz
Algorithm Hash digest
SHA256 07b0d849d93ce61887d8494b1156c3e00af8a0ee9da2b81f5d5d85e044f1e9f2
MD5 e2392517d00d8c72e0e801263e6fb738
BLAKE2b-256 05a0b679a9e1f2c918630a283f6008e4b2a3f1912b88dfcf866eb10846df2493

See more details on using hashes here.

File details

Details for the file astrogenomics_sgtree-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astrogenomics_sgtree-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9fb9865020b92ae71866f42d0c4820ea29db33f54974495ebb6fb2753ecb54d3
MD5 503fa62364ba2b934cbc2e22c80e661f
BLAKE2b-256 cb9825a63d29d0dd446355ac2a83dbd4126b4fd7d1a7e7b8d217bd82fbd60646

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page