Skip to main content

A simple Snakemake pipeline for episodic selection analysis

Project description

BABAPPASnake

DOI

babappasnake is a command-line workflow for episodic positive selection analysis on one orthogroup at a time. It is built for practical comparative genomics: reproducible run directories, guided stepwise execution, resumable restarts after interruption, and robustness summaries across multiple alignment and trimming pathways.

This README is the user manual for the current command-line interface (CLI) and workflow behavior.

Contents

  1. What The Pipeline Does
  2. Installation And Environment
  3. Input Requirements
  4. Orthogroup Discovery Modes
  5. Running The Pipeline
  6. Resume And Recovery
  7. Workflow Stages
  8. Alignment And Robustness Design
  9. Recombination, Selection, And ASR
  10. Outputs And Run Directory Layout
  11. CLI Reference
  12. Troubleshooting
  13. Developer Notes
  14. License

What The Pipeline Does

A typical full run performs these stages:

  1. Define an orthogroup from a protein query and proteome panel, or stage an externally curated orthogroup FASTA.
  2. Map user-provided coding sequence (CDS) records to the selected orthogroup proteins with coding-quality filters.
  3. Build protein and codon alignments across one or more alignment methods.
  4. Run both raw and clipkit pathway variants for robustness.
  5. Infer pathway-specific trees with IQ-TREE.
  6. Optionally root trees with an outgroup text query.
  7. Optionally screen for recombination with HyPhy genetic algorithm for recombination detection (GARD).
  8. Run HyPhy adaptive branch-site random effects likelihood (aBSREL) and mixed effects model of evolution (MEME).
  9. Select branch-site foregrounds dynamically from aBSREL output.
  10. Run branch-site codeml.
  11. Optionally run pathway-level codeml ancestral sequence reconstruction (ASR).
  12. Optionally extract ancestor and descendant sequences plus branch substitutions.
  13. Write pathway summaries, robustness reports, and run provenance.

Design assumptions:

  • The workflow operates on one orthogroup at a time.
  • CDS can be unavailable at the start. The pipeline supports a protein-first checkpointed workflow.
  • Outgroup is optional. If it is absent or unusable, downstream analysis continues with the unrooted tree.
  • Robustness mode is always enforced internally as raw + clipkit, even if you request a single trimming mode.

Installation And Environment

Recommended one-command environment

conda create -n babappasnake -c conda-forge -c bioconda \
  python=3.11 blast orthofinder iqtree hyphy paml clipkit mafft prank pip
conda activate babappasnake
pip install babappasnake

External tools by feature

Always needed for a complete end-to-end run after orthogroup proteins are available:

  • iqtree or iqtree2 or iqtree3
  • hyphy
  • codeml from paml
  • clipkit

Needed only when using the default OrthoFinder-assisted orthogroup mode:

  • blastp
  • makeblastdb
  • orthofinder for --orthogroup-method orthofinder

Needed only for selected alignment methods:

  • mafft when --alignment-methods is 2 or 4
  • prank when --alignment-methods is 3 or 4

Python-side note:

  • babappalign is installed automatically as a dependency of babappasnake.

Verify installation

babappasnake --help
which blastp makeblastdb iqtree iqtree2 iqtree3 hyphy codeml clipkit
which orthofinder mafft prank
which babappasnake

Input Requirements

--prot

Directory of protein FASTA files, one file per species.

Requirements and behavior:

  • Supported extensions are .fa, .faa, and .fasta, case-insensitive.
  • Hidden files and macOS metadata sidecars such as ._* and .DS_Store are ignored.
  • Empty or malformed FASTA files are skipped with warnings.

--query

Protein FASTA containing exactly one query sequence.

--cds

Optional on the first run.

Behavior:

  • If CDS is supplied at the start, the workflow can proceed through codon, tree, HyPhy, codeml, and summary stages in one run.
  • If CDS is not supplied, the workflow stops after orthogroup definition and writes orthogroup/WAITING_FOR_CDS.txt.
  • You can then place CDS at OUTDIR/user_supplied/orthogroup_cds.fasta and resume.

--orthogroup-proteins

Optional externally curated orthogroup protein FASTA.

Behavior:

  • When provided, BABAPPASNAKE skips OrthoFinder and starts from this protein FASTA.
  • This is recommended when orthology has already been inferred, manually curated, or benchmarked outside the workflow.
  • The FASTA must contain the query protein plus at least one partner sequence.
  • If --query is also provided, its first sequence ID is used as the query ID in orthogroup metadata; otherwise the first record in --orthogroup-proteins is treated as the query.

--outgroup

Optional text query used to root pathway trees by case-insensitive substring matching against tip labels.

Behavior:

  • If omitted, rooting is skipped and the unrooted tree is propagated downstream.
  • If provided but unmatched or too broad, rooting falls back safely to the unrooted tree instead of crashing the run.

Orthogroup Discovery Modes

External curated orthogroup: --orthogroup-proteins

Orthology inference can be handled outside BABAPPASNAKE. Use --orthogroup-proteins to provide a curated protein FASTA from OrthoFinder, SonicParanoid, OMA, manual curation, or any other orthology workflow. The pipeline then performs CDS mapping, alignment, tree inference, HyPhy/Phylogenetic Analysis by Maximum Likelihood (PAML) analyses, robustness summaries, and provenance reporting without running orthology inference internally.

Default: --orthogroup-method orthofinder

OrthoFinder is the default built-in orthogroup helper when an external orthogroup FASTA is not supplied.

Behavior:

  1. Run OrthoFinder.
  2. BLAST the query against OrthoFinder orthogroup-member proteins.
  3. Rank orthogroups by query support.
  4. Extract members from the best supported orthogroup using --orthology-mode.

OrthoFinder query mapping details

The workflow does not assume that the original query ID is directly present inside an OrthoFinder orthogroup. Instead it does a query-to-members mapping step:

  1. Parse Orthogroups.tsv.
  2. Build a combined FASTA of orthogroup-member proteins.
  3. BLAST the query against that combined FASTA.
  4. Filter by query coverage threshold.
  5. Rank orthogroups by cross-species support and bitscore.
  6. Extract the top supported orthogroup and apply the configured orthology/paralogy mode.

Orthology and paralogy modes

  • --orthology-mode representative is the default. Single-copy species are retained directly; multi-copy species contribute the best query-supported copy.
  • --orthology-mode strict retains only species with exactly one member in the selected orthogroup.
  • --orthology-mode paralog retains all copies from each species in the selected orthogroup.

Running The Pipeline

Guided interactive mode

babappasnake

Behavior:

  • Prompts for run settings.
  • Executes the workflow step by step.
  • Shows step descriptions, previews, and output summaries.
  • Lets you choose run, skip, or stop at each step where appropriate.

Important defaults in guided mode:

  • Orthogroup method defaults to orthofinder.
  • Orthology mode defaults to representative.
  • Alignment methods default to 4 which means all three aligners.
  • Trimming is forced internally to robustness mode: raw + clipkit.
  • Recombination defaults to none.
  • ASR defaults to yes.
  • CDS is requested only after orthogroup definition.
  • Outgroup is requested after CDS handling and remains optional.

Non-interactive batch mode

Use this for scripted or high-performance-computing-style runs:

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --outdir run01 \
  --threads 12 \
  --interactive no \
  --guided no

Example: representative orthology mode with ASR disabled

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --orthology-mode representative \
  --run-asr no \
  --outdir run_representative_no_asr \
  --threads 12 \
  --interactive no \
  --guided no

Example: retain all paralog copies

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --orthology-mode paralog \
  --outdir run_paralog_mode \
  --threads 12 \
  --interactive no \
  --guided no

Example: start from externally curated orthogroup proteins

babappasnake \
  --orthogroup-proteins /path/to/curated_orthogroup_proteins.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --outdir run_external_orthogroup \
  --threads 12 \
  --interactive no \
  --guided no

Example: enable GARD

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --recombination gard \
  --gard-mode Faster \
  --gard-rate-classes 3 \
  --outdir run_gard \
  --threads 12 \
  --interactive no \
  --guided no

Two-stage run when CDS is not available initially

Stage 1:

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --outdir run01 \
  --interactive no \
  --guided no

After orthogroup definition, provide the CDS file and continue:

babappasnake --resume --outdir run01 --cds /path/to/orthogroup_cds.fasta

Resume And Recovery

Use resume whenever a run stops unexpectedly or intentionally:

babappasnake --resume --outdir run01

Typical reasons to resume:

  • workflow failure
  • external tool failure
  • terminal closure
  • power interruption
  • machine restart
  • manual stop after the orthogroup checkpoint

What --resume does:

  • reloads OUTDIR/config.yaml
  • reuses saved analysis settings
  • clears stale Snakemake lock state for that run directory
  • continues incomplete work instead of restarting from scratch

Guided resume behavior:

  • saved and detected completed steps are skipped automatically
  • the session restarts near the interruption point

Non-guided resume behavior:

  • Snakemake reruns incomplete work only

Files used for resume:

  • OUTDIR/config.yaml
  • OUTDIR/.babappasnake/resume_state.json

Allowed resume-time overrides:

  • --cds PATH
  • --outgroup TEXT
  • --threads INT
  • --guided {yes,no}
  • --snake-args "..."

Analysis settings that define workflow structure are intentionally not changeable on resume. For example, do not expect --resume to accept a new orthogroup method, new alignment mode, new recombination mode, or new ASR setting for an existing run directory.

Workflow Stages

  1. define_orthogroup Build the selected orthogroup with OrthoFinder or stage externally curated orthogroup proteins.
  2. map_cds Map CDS to selected proteins and filter low-quality CDS.
  3. align_proteins_all_methods Build protein alignments.
  4. align_cds_all_methods Build codon alignments.
  5. prepare_branch_inputs_all_pathways Expand each method into raw and clipkit analysis pathways.
  6. gard_all_pathways Optional recombination screening.
  7. iqtree_ml_all_pathways Infer pathway trees.
  8. root_iqtree_outgroup_all_pathways Root trees when possible, otherwise propagate unrooted trees.
  9. hyphy_exploratory_all_pathways Run aBSREL and MEME.
  10. parse_foregrounds_all_pathways Select significant foreground branches from aBSREL.
  11. prepare_foreground_trees_all_pathways Build branch-labeled trees for branch-site codeml.
  12. branchsite_batch_all_pathways Run branch-site codeml and Benjamini-Hochberg (BH)-correct results.
  13. codeml_asr_all_pathways Optional pathway-level ASR.
  14. extract_selected_branch_ancestors Optional branch-level ancestral and descendant sequence extraction.
  15. final_summary_all_pathways Write pathway-specific episodic selection summaries.
  16. robustness_reports Write cross-pathway robustness outputs.
  17. write_run_provenance Write machine-readable provenance for the run.

Alignment And Robustness Design

Alignment methods

--alignment-methods choices:

  • 1: babappalign
  • 2: mafft
  • 3: prank
  • 4: babappalign, mafft, and prank

Trimming behavior

The workflow accepts --trim-strategy, but runtime robustness mode is always enforced internally as:

  • raw
  • clipkit

This means each selected alignment method expands into two pathways.

Example for --alignment-methods 4:

  • babappalign_raw
  • babappalign_clipkit
  • mafft_raw
  • mafft_clipkit
  • prank_raw
  • prank_clipkit

This is deliberate. Robustness outputs assume both untrimmed and trimmed pathway variants are available.

Recombination, Selection, And ASR

Tree inference and rooting

  • IQ-TREE runs once per active pathway.
  • Rooting is optional.
  • If outgroup matching fails, the workflow continues with the unrooted tree.

Optional recombination screening with GARD

Controlled by --recombination {none,gard,auto}:

  • none: do not run GARD
  • gard: run HyPhy GARD
  • auto: currently an alias of gard

Additional controls:

  • --gard-mode {Normal,Faster}
  • --gard-rate-classes INT

GARD outputs:

  • recombination/<method>/<trim_state>/gard/gard.json
  • recombination/<method>/<trim_state>/gard/gard_summary.json
  • recombination/<method>/<trim_state>/gard/gard.stdout.txt
  • recombination/<method>/<trim_state>/gard/gard.stderr.txt

Current interpretation:

  • GARD is used as a screening and reporting layer.
  • Downstream branch-site codeml is still run on the full-length pathway alignment unless fragment-aware routing is added in a future release.

HyPhy and branch-site foreground selection

  • aBSREL and MEME run per pathway.
  • Foregrounds are selected dynamically from aBSREL.
  • Default aBSREL dynamic threshold settings:
    • start 0.05
    • increment 0.01
    • cap 0.2
  • Selected foregrounds are passed to branch-site codeml.
  • Branch-site results are BH-corrected.

Optional ASR block

When --run-asr yes, the workflow:

  1. runs codeml ASR per pathway
  2. maps selected branches to parent and child nodes
  3. extracts ancestor and descendant CDS and amino-acid sequences
  4. computes branch substitutions
  5. annotates overlaps with MEME and BEB where available

When --run-asr no:

  • the ASR extraction block is skipped entirely
  • HyPhy, branch-site, summaries, robustness reports, and run provenance still complete
  • per-pathway asr_done.json files are still written with status: skipped
  • the robustness matrix reports ASR as not completed rather than falsely completed

Outputs And Run Directory Layout

All outputs live inside --outdir.

Early-stage outputs

  • inputs/query.fasta
  • inputs/proteomes/
  • orthogroup/orthogroup_proteins.fasta
  • orthogroup/orthogroup_headers.txt
  • orthogroup/orthogroup_summary.tsv
  • orthogroup/WAITING_FOR_CDS.txt
  • user_supplied/orthogroup_cds.fasta
  • mapped_cds/cds_protein_mapping.tsv

Alignment and tree outputs

  • alignments/<method>/orthogroup_proteins.protein.aln.fasta
  • alignments/<method>/mapped_orthogroup_cds.codon.aln.fasta
  • alignments/<method>/<trim_state>/orthogroup_proteins.analysis.fasta
  • alignments/<method>/<trim_state>/mapped_orthogroup_cds.analysis.fasta
  • tree/<method>/<trim_state>/orthogroup.treefile
  • tree/<method>/<trim_state>/orthogroup.rooted.treefile

Selection outputs

  • hyphy/<method>/<trim_state>/absrel.json
  • hyphy/<method>/<trim_state>/meme.json
  • hyphy/<method>/<trim_state>/hyphy_done.json
  • hyphy/<method>/<trim_state>/significant_foregrounds.tsv
  • hyphy/<method>/<trim_state>/foreground_threshold.json
  • branchsite/<method>/<trim_state>/foreground_trees.tsv
  • branchsite/<method>/<trim_state>/branchsite_results.tsv

ASR outputs

Always present per pathway:

  • asr/<method>/<trim_state>/asr_done.json

Produced only when --run-asr yes:

  • asr/<method>/<trim_state>/mlc_asr.txt
  • asr/<method>/<trim_state>/rst
  • asr/branch_to_nodes.tsv
  • asr/ancestor_sequences_cds.fasta
  • asr/ancestor_sequences_aa.fasta
  • asr/descendant_sequences_cds.fasta
  • asr/descendant_sequences_aa.fasta
  • asr/branch_substitutions.tsv
  • asr/selected_branch_asr_summary.tsv
  • asr/asr_extraction_provenance.json
  • asr/asr_done.json

Summary and provenance outputs

  • summary/<method>/<trim_state>/episodic_selection_summary.txt
  • summary/robustness_matrix.tsv
  • summary/robustness_consensus.tsv
  • summary/robustness_narrative.txt
  • summary/comparative_reproducibility_summary.txt
  • summary/robustness_publication_table.tex
  • summary/run_provenance.json
  • summary/episodic_selection_summary.txt

Internal resume state

  • .babappasnake/resume_state.json
  • .snakemake/

CLI Reference

Basic form for a fresh run:

babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]

Basic form for a resumed run:

babappasnake --resume --outdir RUN_DIR [resume overrides]

Core options

  • --prot PATH
  • --query PATH
  • --cds PATH
  • --orthogroup-proteins PATH
  • --outdir PATH
  • --interactive {yes,no}
  • --guided {yes,no}
  • --resume
  • --snake-args "..."

Orthogroup and alignment options

  • --orthogroup-method {orthofinder}
  • --orthology-mode {strict,representative,paralog}
  • --coverage FLOAT
  • --alignment-methods {1,2,3,4}
  • --trim-strategy {raw,clipkit,both}
  • --use-clipkit {yes,no}
  • --clipkit-mode-protein TEXT
  • --clipkit-mode-codon TEXT

Practical note:

  • --trim-strategy and --use-clipkit are accepted, but runtime robustness mode is still forced to both.

Tree and selection options

  • --outgroup TEXT
  • --threads INT
  • --iqtree-bootstrap INT
  • --iqtree-bnni {yes,no}
  • --iqtree-model TEXT
  • --recombination {none,gard,auto}
  • --gard-mode {Normal,Faster}
  • --gard-rate-classes INT
  • --absrel-branches TEXT
  • --meme-branches TEXT
  • --codeml-codonfreq INT
  • --run-asr {yes,no}
  • --absrel-p FLOAT
  • --absrel-dynamic-start FLOAT
  • --absrel-dynamic-step FLOAT
  • --absrel-dynamic-max FLOAT
  • --meme-p FLOAT

Troubleshooting

Run stops at WAITING_FOR_CDS.txt

This is expected when CDS was not supplied yet.

Fix:

  1. provide OUTDIR/user_supplied/orthogroup_cds.fasta
  2. run babappasnake --resume --outdir OUTDIR

Outgroup was not available at the start

This is supported.

Behavior:

  • the run does not need to crash
  • downstream continues with the unrooted tree when no usable outgroup is present

Interrupted run after crash, reboot, or power loss

Use:

babappasnake --resume --outdir OUTDIR

The workflow will reload saved settings, clear stale lock state, and continue incomplete work.

OrthoFinder finds no usable orthogroup

The pipeline stops explicitly when the selected orthogroup and orthology mode retain no partner sequences.

Check:

  • query quality
  • proteome quality
  • taxon sampling
  • whether the query is biologically represented in the panel
  • whether a curated external orthogroup should be supplied with --orthogroup-proteins

macOS metadata files inside the proteome directory

They are ignored automatically, but you can clean them if you want:

find /path/to/proteomes -type f \( -name '._*' -o -name '.DS_Store' \) -delete

codeml or HyPhy warnings

Warnings are tolerated when required result files are present. Hard failure occurs when required outputs are missing or unusable.

--resume refuses my new analysis flags

This is intentional. Resume is designed to continue an existing run, not mutate its workflow definition midway through. Start a new --outdir if you want different analysis settings.

Developer Notes

Editable install

pip install -e .

Build distributions

python -m pip install --upgrade build twine
python -m build --sdist --wheel
twine check dist/*

Publish

twine upload dist/*

Release checklist

  1. Update version in pyproject.toml and babappasnake/__init__.py.
  2. Run the test suite.
  3. Build source and wheel distributions.
  4. Check distributions.
  5. Publish to PyPI.
  6. Tag the release in Git.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappasnake-1.4.0.tar.gz (81.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappasnake-1.4.0-py3-none-any.whl (78.2 kB view details)

Uploaded Python 3

File details

Details for the file babappasnake-1.4.0.tar.gz.

File metadata

  • Download URL: babappasnake-1.4.0.tar.gz
  • Upload date:
  • Size: 81.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babappasnake-1.4.0.tar.gz
Algorithm Hash digest
SHA256 b0b44a498565ac5685fa92d62e7cbbcfe61cfd000b852bcedb5f916664849aca
MD5 ae764df725312e372fda1029bba2224b
BLAKE2b-256 dcbedb841a380fd775a49befa1b66ff8342da8e2bb4b53eafb71d6cd88f46c5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-1.4.0.tar.gz:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babappasnake-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: babappasnake-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 78.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babappasnake-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 86a19955c52bec95bd21f59a955959f4ec960ea704a0122419b794658368aed4
MD5 01940ec2f4849cd89d76f743ff1fb299
BLAKE2b-256 44987bf9cc655e6e8c2cb772ac784b23c87726a77429da27655d4710d60e6ed5

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-1.4.0-py3-none-any.whl:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page