Skip to main content

Species tree construction from marker gene phylogenies

Project description

SGTree

SGTree is an end-to-end workflow for phylogenetic tree building. Use the provided sets of HMMs or provide your own HMMs to find the proteins of interest. SGTree then performs gene tree to approximate species tree reconciliation to select the most likely correct copy of a protein in case of duplications (paralogs, contamination).

Setup

Install the Pixi environment:

pixi install

The environment is managed through pixi.toml only.

Release and Deploy (AGP)

Use this procedure whenever you change code in this sgtree repo and want to publish a new release to PyPI and the astrogenomics pixi/prefix channel.

Important naming:

  • PyPI project: astrogenomics-sgtree
  • Conda/pixi package: sgtree

1. Prerequisites

Ensure these environment variables are set in your shell:

export PYPI_API_TOKEN="..."
export PREFIX_API_KEY="..."

AGP config used for this project:

  • /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml

2. Run the AGP release

From the AGP repo, run:

cd /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp
pixi run -q agp \
  --config /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml \
  --project /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree \
  release <VERSION>

Example:

pixi run -q agp \
  --config /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp/examples/sgtree.toml \
  --project /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree \
  release 2.0.1

What AGP does automatically:

  • updates version fields in sgtree/__init__.py and pixi.toml
  • builds and uploads astrogenomics-sgtree to PyPI
  • updates recipe.yaml version, source URL, and sha256
  • builds conda package sgtree and uploads it to prefix.dev/astrogenomics
  • does not create git tags, pushes, or GitHub releases (disabled for this config)

3. Verify the release

Check prefix channel:

pixi search sgtree -c https://prefix.dev/astrogenomics -c conda-forge

Check PyPI:

python -m pip index versions astrogenomics-sgtree

4. Install globally with pixi

pixi global install -c https://prefix.dev/astrogenomics -c conda-forge "sgtree==<VERSION>"

5. If prefix upload fails but build succeeded

Retry only the prefix upload with the built artifact:

cd /clusterfs/jgi/groups/science/homes/jvillada/my_software/agp
pixi run -q rattler-build upload prefix \
  --channel astrogenomics \
  --skip-existing \
  /clusterfs/jgi/groups/science/homes/jvillada/my_software/sgtree/dist/conda/noarch/sgtree-<VERSION>-*.conda

Run

Primary interface (Nextflow):

pixi run sgtree --help

Basic run:

pixi run sgtree \
  --genomedir <path to dir with protein faa files, one faa file per genome> \
  --modeldir <path to marker set .hmm>

Example run:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm

Marker-selection run with references and singleton filtering:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --outdir runs/nextflow/manual_full \
  --marker_selection true \
  --ref testgenomes/chlorref \
  --singles yes

pixi run sgtree writes logs automatically to runs/nextflow/logs/. Marker searches and --aln hmmalign are run with pyhmmer (HMMER-compatible search output).

Example with IQ-TREE and explicit HMM threshold mode:

pixi run sgtree \
  --genomedir testgenomes/Chloroflexi \
  --modeldir resources/models/UNI56.hmm \
  --tree_method iqtree \
  --iqtree_fast true \
  --hmmsearch_cutoff cut_ga

Second choice (Python implementation without nextflow):

pixi run sgtree-python testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8

Backward-compatible wrapper:

pixi run python ./bin/sgtree_wrapper.py testgenomes/Chloroflexi resources/models/UNI56.hmm --num_cpus 8

Settings

Core method controls:

  • --aln: hmmalign, mafft, or mafft-linsi (default hmmalign).
  • --tree_method: fasttree or iqtree (default fasttree) for both species tree and per-marker trees.
  • --iqtree_fast: apply -fast when --tree_method iqtree (default true).
  • --iqtree_model: IQ-TREE model string (default LG+F+I+G4).

HMM search thresholds:

  • --hmmsearch_cutoff cut_ga: use model gathering cutoffs (recommended for curated marker sets such as UNI56).
  • --hmmsearch_cutoff cut_tc: use model trusted cutoffs.
  • --hmmsearch_cutoff cut_nc: use model noise cutoffs.
  • --hmmsearch_cutoff evalue --hmmsearch_evalue <float>: use a plain E-value threshold.

Genome inclusion/exclusion criteria:

  • --percent_models (default 10): minimum fraction of markers detected per genome.
  • --max_sdup (default -1): maximum allowed copies of any single marker in one genome; -1 disables.
  • --max_dupl (default -1): maximum allowed fraction of markers present in multiple copies; -1 disables.
  • --lflt (default 0): optional per-marker length filter (% of median hit length).
  • --num_nei (default 0): optional singleton-removal neighbor count override (0 keeps auto mode).

nsgtree-style mapping:

  • minmarker -> --percent_models (fraction mapped to percent).
  • maxsdup -> --max_sdup.
  • maxdupl -> --max_dupl.
  • hmmsearch_cutoff -> --hmmsearch_cutoff and --hmmsearch_evalue.
  • tmethod -> --tree_method.
  • iq_* model controls -> --iqtree_model (and --iqtree_fast).
  • mafftv/mafft -> --aln mafft or --aln mafft-linsi (or --aln hmmalign).

Practical selection guide:

  • Curated marker sets (for example UNI56): start with --hmmsearch_cutoff cut_ga.
  • Less curated/custom marker sets: start with --hmmsearch_cutoff evalue --hmmsearch_evalue 1e-5, then tighten if false positives appear.
  • --aln hmmalign is the fastest stable default and keeps alignment behavior tied to each profile HMM.
  • --aln mafft-linsi is slower but can help when marker-specific profile alignment is not desired.
  • --tree_method fasttree is the quick default; --tree_method iqtree --iqtree_fast true is a practical higher-accuracy option.
  • Typical inclusion presets:
  • Balanced: --percent_models 10 --max_sdup 2 --max_dupl 0.25
  • Strict: --percent_models 30 --max_sdup 1 --max_dupl 0.10
  • Relaxed: --percent_models 5 --max_sdup -1 --max_dupl -1

Input Requirements

Proteomes must be FASTA (*.faa). SGTree now normalizes all inputs internally to:

>IMG2684622718|2685462912
MLCAFAEEEAKIAETVGKVATELKVKKLLSDFATKEGEEHISTYNKIAMTAKAEGYADIEAMLCAFAEEEAKLQKL

Normalization behavior:

  • Directory input (--genomedir <dir>): one proteome per *.faa; genome id is derived from filename stem.
  • Single FASTA input (--genomedir <file>): if headers already contain genome|protein, the genome part is preserved.
  • Headers and IDs are sanitized to avoid delimiter collisions.
  • Malformed header joins (for example ...*>next_header) are repaired before parsing.
  • Invalid amino-acid characters are replaced with X; * is removed.
  • Header mapping is written as proteomes_header_map_<input>.tsv in --outdir.

Output Structure

Nextflow output (--outdir):

<outdir>/
  tree.nwk
  tree_final.nwk                 # marker-selection mode
  tree_final.png                 # marker-selection mode
  marker_count_matrix.csv
  marker_count.txt               # basic mode
  marker_counts.txt              # marker-selection mode
  marker_selection_rf_values.txt # marker-selection mode
  color.txt
  log_genomes_removed.txt
  proteomes_header_map_<input>.tsv

Python output (--save_dir):

<save_dir>/
  tree.nwk or tree_final.nwk
  tree_final.png                  # marker-selection mode
  marker_count_matrix.csv
  marker_selection_rf_values.txt  # marker-selection mode
  log_genomes_removed.txt
  logfile_*.txt
  temp/
    *.zip
    itol/

Repository Structure

sgtree/
  sgtree/                 # Python package implementation
  bin/sgtree_wrapper.py   # backward-compatible wrapper
  main.nf                 # Nextflow entrypoint
  workflows/              # DSL2 workflow composition
  modules/                # DSL2 process modules
  bin/                    # helper scripts and launch wrappers
  tests/
    regression_parity.py  # cross-engine parity checks
  resources/
    models/               # combined marker-set HMM files
  testgenomes/            # example query/reference data
  runs/                   # runtime outputs/work/logs (.gitkeep tracked)
  pixi.toml               # reproducible environment + tasks
  nextflow.config         # runtime defaults and CPU settings

Workflow

                            +-------------------+
                            |  Input Proteomes  |
                            |  + HMM Models     |
                            +---------+---------+
                                      |
                                      v
                             +--------+--------+
                             |    HMMSEARCH    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | PARSE_HMMSEARCH |
                             | marker matrix   |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | EXTRACT_SEQS    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ALIGN (hmmalign/|
                             | mafft/linsi)    |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | ELIM_DUPLICATES |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |     TRIMAL      |
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             | BUILD_SUPERMATRIX|
                             +--------+--------+
                                      |
                                      v
                             +--------+--------+
                             |  TREE_BUILDER   |
                             |   tree.nwk      |
                             +--------+--------+
                                      |
                          marker_selection?
                           /            \
                        no               yes
                        |                 |
                        v                 v
                  +-----+-----+   +-------+--------+
                  | iTOL TXT  |   | per-marker     |
                  | marker_*  |   | TRIMAL+TREEBLD |
                  +-----------+   +-------+--------+
                                         |
                                         v
                                  +------+------+
                                  | RF_SELECTION|
                                  +------+------+
                                         |
                                 singles?|
                                  /      \
                               no         yes
                               |           |
                               v           v
                      +--------+---+   +---+--------+
                      | WRITE_CLEAN |   |REMOVE_     |
                      | ALIGNMENTS  |   |SINGLES     |
                      +--------+----+   +---+--------+
                               \           /
                                \         /
                                 v       v
                               +--+------+
                               |TRIMAL_FINAL
                               +--+------+
                                  |
                                  v
                             +----+-----+
                             |SUPERMATRIX|
                             +----+-----+
                                  |
                                  v
                             +----+-----+
                             |TREE_BUILDER|
                             |tree_final |
                             +----+-----+
                                  |
                                  v
                       +----------+-----------+
                       | tree_final.png       |
                       | marker_counts.txt    |
                       | marker_selection_rf  |
                       +----------------------+

Repository Hygiene

Use this command for a clean runtime workspace between runs:

pixi run clean-runtime

Authors and Contributors

Author Email Date
Ewan Whittaker-Walker ewanww@berkeley.edu 05/19/2019
Frederik Schulz fschulz@lbl.gov Since 2019
Juan C. Villada jvillada@lbl.gov Since 2021
Marianne Buscaglia mbuscaglia@lbl.gov Since 2022

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

astrogenomics_sgtree-1.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

astrogenomics_sgtree-1.0-py3-none-any.whl (33.0 kB view details)

Uploaded Python 3

File details

Details for the file astrogenomics_sgtree-1.0.tar.gz.

File metadata

  • Download URL: astrogenomics_sgtree-1.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for astrogenomics_sgtree-1.0.tar.gz
Algorithm Hash digest
SHA256 3b52d5af3189d938eb027ffe61be87cda0b90e9bcdd3aed82e4a910bb65580a4
MD5 83d1d13c7a16dceb969080741870186d
BLAKE2b-256 489fe2a96b0019c56fde7e1a7e631625c1e2ef8fb8a9bfa68b0a7df461f89799

See more details on using hashes here.

File details

Details for the file astrogenomics_sgtree-1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for astrogenomics_sgtree-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb05f19aa432e0299adf3245603a3a11327c3e0d4b60a7d6429f8d0ae37afc01
MD5 b2e961cefa990e5806577b4abd888059
BLAKE2b-256 f33aaf6bdee8c62194c78ea0f5bbf80e082a0728a746660ea4014612dd832e00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page