A simple Snakemake pipeline for episodic selection analysis
Project description
BABAPPASnake
babappasnake is a command-line workflow for episodic positive selection analysis on one orthogroup at a time. It is built for practical comparative genomics: reproducible run directories, guided stepwise execution, resumable restarts after interruption, and robustness summaries across multiple alignment and trimming pathways.
This README is the user manual for the current command-line interface (CLI) and workflow behavior.
Contents
- What The Pipeline Does
- Installation And Environment
- Input Requirements
- Orthogroup Discovery Modes
- Running The Pipeline
- Resume And Recovery
- Workflow Stages
- Alignment And Robustness Design
- Recombination, Selection, And ASR
- Outputs And Run Directory Layout
- CLI Reference
- Troubleshooting
- Developer Notes
- License
What The Pipeline Does
A typical full run performs these stages:
- Define an orthogroup from a protein query and proteome panel, or stage an externally curated orthogroup FASTA.
- Map user-provided coding sequence (CDS) records to the selected orthogroup proteins with coding-quality filters.
- Build protein and codon alignments across one or more alignment methods.
- Run both
rawandclipkitpathway variants for robustness. - Infer pathway-specific trees with IQ-TREE.
- Optionally root trees with an outgroup text query.
- Optionally screen for recombination with HyPhy genetic algorithm for recombination detection (GARD).
- Run HyPhy adaptive branch-site random effects likelihood (aBSREL) and mixed effects model of evolution (MEME).
- Select branch-site foregrounds dynamically from aBSREL output.
- Run branch-site codeml.
- Optionally run pathway-level codeml ancestral sequence reconstruction (ASR).
- Optionally extract ancestor and descendant sequences plus branch substitutions.
- Write pathway summaries, robustness reports, and run provenance.
Design assumptions:
- The workflow operates on one orthogroup at a time.
- CDS can be unavailable at the start. The pipeline supports a protein-first checkpointed workflow.
- Outgroup is optional. If it is absent or unusable, downstream analysis continues with the unrooted tree.
- Robustness mode is always enforced internally as
raw + clipkit, even if you request a single trimming mode.
Installation And Environment
Recommended one-command environment
conda create -n babappasnake -c conda-forge -c bioconda \
python=3.11 blast orthofinder iqtree hyphy paml clipkit mafft prank pip
conda activate babappasnake
pip install babappasnake
External tools by feature
Always needed for a complete end-to-end run after orthogroup proteins are available:
iqtreeoriqtree2oriqtree3hyphycodemlfrompamlclipkit
Needed only when using the default OrthoFinder-assisted orthogroup mode:
blastpmakeblastdborthofinderfor--orthogroup-method orthofinder
Needed only for selected alignment methods:
mafftwhen--alignment-methodsis2or4prankwhen--alignment-methodsis3or4
Python-side note:
babappalignis installed automatically as a dependency ofbabappasnake.
Verify installation
babappasnake --help
which blastp makeblastdb iqtree iqtree2 iqtree3 hyphy codeml clipkit
which orthofinder mafft prank
which babappasnake
Input Requirements
--prot
Directory of protein FASTA files, one file per species.
Requirements and behavior:
- Supported extensions are
.fa,.faa, and.fasta, case-insensitive. - Hidden files and macOS metadata sidecars such as
._*and.DS_Storeare ignored. - Empty or malformed FASTA files are skipped with warnings.
--query
Protein FASTA containing exactly one query sequence.
--cds
Optional on the first run.
Behavior:
- If CDS is supplied at the start, the workflow can proceed through codon, tree, HyPhy, codeml, and summary stages in one run.
- If CDS is not supplied, the workflow stops after orthogroup definition and writes
orthogroup/WAITING_FOR_CDS.txt. - You can then place CDS at
OUTDIR/user_supplied/orthogroup_cds.fastaand resume.
--orthogroup-proteins
Optional externally curated orthogroup protein FASTA.
Behavior:
- When provided, BABAPPASNAKE skips OrthoFinder and starts from this protein FASTA.
- This is recommended when orthology has already been inferred, manually curated, or benchmarked outside the workflow.
- The FASTA must contain the query protein plus at least one partner sequence.
- If
--queryis also provided, its first sequence ID is used as the query ID in orthogroup metadata; otherwise the first record in--orthogroup-proteinsis treated as the query.
--outgroup
Optional text query used to root pathway trees by case-insensitive substring matching against tip labels.
Behavior:
- If omitted, rooting is skipped and the unrooted tree is propagated downstream.
- If provided but unmatched or too broad, rooting falls back safely to the unrooted tree instead of crashing the run.
Orthogroup Discovery Modes
External curated orthogroup: --orthogroup-proteins
Orthology inference can be handled outside BABAPPASNAKE.
Use --orthogroup-proteins to provide a curated protein FASTA from OrthoFinder, SonicParanoid, OMA, manual curation, or any other orthology workflow.
The pipeline then performs CDS mapping, alignment, tree inference, HyPhy/Phylogenetic Analysis by Maximum Likelihood (PAML) analyses, robustness summaries, and provenance reporting without running orthology inference internally.
Default: --orthogroup-method orthofinder
OrthoFinder is the default built-in orthogroup helper when an external orthogroup FASTA is not supplied.
Behavior:
- Run OrthoFinder.
- BLAST the query against OrthoFinder orthogroup-member proteins.
- Rank orthogroups by query support.
- Extract members from the best supported orthogroup using
--orthology-mode.
OrthoFinder query mapping details
The workflow does not assume that the original query ID is directly present inside an OrthoFinder orthogroup. Instead it does a query-to-members mapping step:
- Parse
Orthogroups.tsv. - Build a combined FASTA of orthogroup-member proteins.
- BLAST the query against that combined FASTA.
- Filter by query coverage threshold.
- Rank orthogroups by cross-species support and bitscore.
- Extract the top supported orthogroup and apply the configured orthology/paralogy mode.
Orthology and paralogy modes
--orthology-mode representativeis the default. Single-copy species are retained directly; multi-copy species contribute the best query-supported copy.--orthology-mode strictretains only species with exactly one member in the selected orthogroup.--orthology-mode paralogretains all copies from each species in the selected orthogroup.
Running The Pipeline
Guided interactive mode
babappasnake
Behavior:
- Prompts for run settings.
- Executes the workflow step by step.
- Shows step descriptions, previews, and output summaries.
- Lets you choose
run,skip, orstopat each step where appropriate.
Important defaults in guided mode:
- Orthogroup method defaults to
orthofinder. - Orthology mode defaults to
representative. - Alignment methods default to
4which means all three aligners. - Trimming is forced internally to robustness mode:
raw + clipkit. - Recombination defaults to
none. - ASR defaults to
yes. - CDS is requested only after orthogroup definition.
- Outgroup is requested after CDS handling and remains optional.
Non-interactive batch mode
Use this for scripted or high-performance-computing-style runs:
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--outdir run01 \
--threads 12 \
--interactive no \
--guided no
Example: representative orthology mode with ASR disabled
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--orthology-mode representative \
--run-asr no \
--outdir run_representative_no_asr \
--threads 12 \
--interactive no \
--guided no
Example: retain all paralog copies
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--orthology-mode paralog \
--outdir run_paralog_mode \
--threads 12 \
--interactive no \
--guided no
Example: start from externally curated orthogroup proteins
babappasnake \
--orthogroup-proteins /path/to/curated_orthogroup_proteins.fasta \
--cds /path/to/orthogroup_cds.fasta \
--outdir run_external_orthogroup \
--threads 12 \
--interactive no \
--guided no
Example: enable GARD
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--recombination gard \
--gard-mode Faster \
--gard-rate-classes 3 \
--outdir run_gard \
--threads 12 \
--interactive no \
--guided no
Two-stage run when CDS is not available initially
Stage 1:
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--outdir run01 \
--interactive no \
--guided no
After orthogroup definition, provide the CDS file and continue:
babappasnake --resume --outdir run01 --cds /path/to/orthogroup_cds.fasta
Resume And Recovery
Use resume whenever a run stops unexpectedly or intentionally:
babappasnake --resume --outdir run01
Typical reasons to resume:
- workflow failure
- external tool failure
- terminal closure
- power interruption
- machine restart
- manual stop after the orthogroup checkpoint
What --resume does:
- reloads
OUTDIR/config.yaml - reuses saved analysis settings
- clears stale Snakemake lock state for that run directory
- continues incomplete work instead of restarting from scratch
Guided resume behavior:
- saved and detected completed steps are skipped automatically
- the session restarts near the interruption point
Non-guided resume behavior:
- Snakemake reruns incomplete work only
Files used for resume:
OUTDIR/config.yamlOUTDIR/.babappasnake/resume_state.json
Allowed resume-time overrides:
--cds PATH--outgroup TEXT--threads INT--guided {yes,no}--snake-args "..."
Analysis settings that define workflow structure are intentionally not changeable on resume. For example, do not expect --resume to accept a new orthogroup method, new alignment mode, new recombination mode, or new ASR setting for an existing run directory.
Workflow Stages
define_orthogroupBuild the selected orthogroup with OrthoFinder or stage externally curated orthogroup proteins.map_cdsMap CDS to selected proteins and filter low-quality CDS.align_proteins_all_methodsBuild protein alignments.align_cds_all_methodsBuild codon alignments.prepare_branch_inputs_all_pathwaysExpand each method intorawandclipkitanalysis pathways.gard_all_pathwaysOptional recombination screening.iqtree_ml_all_pathwaysInfer pathway trees.root_iqtree_outgroup_all_pathwaysRoot trees when possible, otherwise propagate unrooted trees.hyphy_exploratory_all_pathwaysRun aBSREL and MEME.parse_foregrounds_all_pathwaysSelect significant foreground branches from aBSREL.prepare_foreground_trees_all_pathwaysBuild branch-labeled trees for branch-site codeml.branchsite_batch_all_pathwaysRun branch-site codeml and Benjamini-Hochberg (BH)-correct results.codeml_asr_all_pathwaysOptional pathway-level ASR.extract_selected_branch_ancestorsOptional branch-level ancestral and descendant sequence extraction.final_summary_all_pathwaysWrite pathway-specific episodic selection summaries.robustness_reportsWrite cross-pathway robustness outputs.write_run_provenanceWrite machine-readable provenance for the run.
Alignment And Robustness Design
Alignment methods
--alignment-methods choices:
1:babappalign2:mafft3:prank4:babappalign,mafft, andprank
Trimming behavior
The workflow accepts --trim-strategy, but runtime robustness mode is always enforced internally as:
rawclipkit
This means each selected alignment method expands into two pathways.
Example for --alignment-methods 4:
babappalign_rawbabappalign_clipkitmafft_rawmafft_clipkitprank_rawprank_clipkit
This is deliberate. Robustness outputs assume both untrimmed and trimmed pathway variants are available.
Recombination, Selection, And ASR
Tree inference and rooting
- IQ-TREE runs once per active pathway.
- Rooting is optional.
- If outgroup matching fails, the workflow continues with the unrooted tree.
Optional recombination screening with GARD
Controlled by --recombination {none,gard,auto}:
none: do not run GARDgard: run HyPhy GARDauto: currently an alias ofgard
Additional controls:
--gard-mode {Normal,Faster}--gard-rate-classes INT
GARD outputs:
recombination/<method>/<trim_state>/gard/gard.jsonrecombination/<method>/<trim_state>/gard/gard_summary.jsonrecombination/<method>/<trim_state>/gard/gard.stdout.txtrecombination/<method>/<trim_state>/gard/gard.stderr.txt
Current interpretation:
- GARD is used as a screening and reporting layer.
- Downstream branch-site codeml is still run on the full-length pathway alignment unless fragment-aware routing is added in a future release.
HyPhy and branch-site foreground selection
- aBSREL and MEME run per pathway.
- Foregrounds are selected dynamically from aBSREL.
- Default aBSREL dynamic threshold settings:
- start
0.05 - increment
0.01 - cap
0.2
- start
- Selected foregrounds are passed to branch-site codeml.
- Branch-site results are BH-corrected.
Optional ASR block
When --run-asr yes, the workflow:
- runs codeml ASR per pathway
- maps selected branches to parent and child nodes
- extracts ancestor and descendant CDS and amino-acid sequences
- computes branch substitutions
- annotates overlaps with MEME and BEB where available
When --run-asr no:
- the ASR extraction block is skipped entirely
- HyPhy, branch-site, summaries, robustness reports, and run provenance still complete
- per-pathway
asr_done.jsonfiles are still written withstatus: skipped - the robustness matrix reports ASR as not completed rather than falsely completed
Outputs And Run Directory Layout
All outputs live inside --outdir.
Early-stage outputs
inputs/query.fastainputs/proteomes/orthogroup/orthogroup_proteins.fastaorthogroup/orthogroup_headers.txtorthogroup/orthogroup_summary.tsvorthogroup/WAITING_FOR_CDS.txtuser_supplied/orthogroup_cds.fastamapped_cds/cds_protein_mapping.tsv
Alignment and tree outputs
alignments/<method>/orthogroup_proteins.protein.aln.fastaalignments/<method>/mapped_orthogroup_cds.codon.aln.fastaalignments/<method>/<trim_state>/orthogroup_proteins.analysis.fastaalignments/<method>/<trim_state>/mapped_orthogroup_cds.analysis.fastatree/<method>/<trim_state>/orthogroup.treefiletree/<method>/<trim_state>/orthogroup.rooted.treefile
Selection outputs
hyphy/<method>/<trim_state>/absrel.jsonhyphy/<method>/<trim_state>/meme.jsonhyphy/<method>/<trim_state>/hyphy_done.jsonhyphy/<method>/<trim_state>/significant_foregrounds.tsvhyphy/<method>/<trim_state>/foreground_threshold.jsonbranchsite/<method>/<trim_state>/foreground_trees.tsvbranchsite/<method>/<trim_state>/branchsite_results.tsv
ASR outputs
Always present per pathway:
asr/<method>/<trim_state>/asr_done.json
Produced only when --run-asr yes:
asr/<method>/<trim_state>/mlc_asr.txtasr/<method>/<trim_state>/rstasr/branch_to_nodes.tsvasr/ancestor_sequences_cds.fastaasr/ancestor_sequences_aa.fastaasr/descendant_sequences_cds.fastaasr/descendant_sequences_aa.fastaasr/branch_substitutions.tsvasr/selected_branch_asr_summary.tsvasr/asr_extraction_provenance.jsonasr/asr_done.json
Summary and provenance outputs
summary/<method>/<trim_state>/episodic_selection_summary.txtsummary/robustness_matrix.tsvsummary/robustness_consensus.tsvsummary/robustness_narrative.txtsummary/comparative_reproducibility_summary.txtsummary/robustness_publication_table.texsummary/run_provenance.jsonsummary/episodic_selection_summary.txt
Internal resume state
.babappasnake/resume_state.json.snakemake/
CLI Reference
Basic form for a fresh run:
babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]
Basic form for a resumed run:
babappasnake --resume --outdir RUN_DIR [resume overrides]
Core options
--prot PATH--query PATH--cds PATH--orthogroup-proteins PATH--outdir PATH--interactive {yes,no}--guided {yes,no}--resume--snake-args "..."
Orthogroup and alignment options
--orthogroup-method {orthofinder}--orthology-mode {strict,representative,paralog}--coverage FLOAT--alignment-methods {1,2,3,4}--trim-strategy {raw,clipkit,both}--use-clipkit {yes,no}--clipkit-mode-protein TEXT--clipkit-mode-codon TEXT
Practical note:
--trim-strategyand--use-clipkitare accepted, but runtime robustness mode is still forced toboth.
Tree and selection options
--outgroup TEXT--threads INT--iqtree-bootstrap INT--iqtree-bnni {yes,no}--iqtree-model TEXT--recombination {none,gard,auto}--gard-mode {Normal,Faster}--gard-rate-classes INT--absrel-branches TEXT--meme-branches TEXT--codeml-codonfreq INT--run-asr {yes,no}--absrel-p FLOAT--absrel-dynamic-start FLOAT--absrel-dynamic-step FLOAT--absrel-dynamic-max FLOAT--meme-p FLOAT
Troubleshooting
Run stops at WAITING_FOR_CDS.txt
This is expected when CDS was not supplied yet.
Fix:
- provide
OUTDIR/user_supplied/orthogroup_cds.fasta - run
babappasnake --resume --outdir OUTDIR
Outgroup was not available at the start
This is supported.
Behavior:
- the run does not need to crash
- downstream continues with the unrooted tree when no usable outgroup is present
Interrupted run after crash, reboot, or power loss
Use:
babappasnake --resume --outdir OUTDIR
The workflow will reload saved settings, clear stale lock state, and continue incomplete work.
OrthoFinder finds no usable orthogroup
The pipeline stops explicitly when the selected orthogroup and orthology mode retain no partner sequences.
Check:
- query quality
- proteome quality
- taxon sampling
- whether the query is biologically represented in the panel
- whether a curated external orthogroup should be supplied with
--orthogroup-proteins
macOS metadata files inside the proteome directory
They are ignored automatically, but you can clean them if you want:
find /path/to/proteomes -type f \( -name '._*' -o -name '.DS_Store' \) -delete
codeml or HyPhy warnings
Warnings are tolerated when required result files are present. Hard failure occurs when required outputs are missing or unusable.
--resume refuses my new analysis flags
This is intentional. Resume is designed to continue an existing run, not mutate its workflow definition midway through. Start a new --outdir if you want different analysis settings.
Developer Notes
Editable install
pip install -e .
Build distributions
python -m pip install --upgrade build twine
python -m build --sdist --wheel
twine check dist/*
Publish
twine upload dist/*
Release checklist
- Update version in
pyproject.tomlandbabappasnake/__init__.py. - Run the test suite.
- Build source and wheel distributions.
- Check distributions.
- Publish to PyPI.
- Tag the release in Git.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babappasnake-1.4.0.tar.gz.
File metadata
- Download URL: babappasnake-1.4.0.tar.gz
- Upload date:
- Size: 81.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b0b44a498565ac5685fa92d62e7cbbcfe61cfd000b852bcedb5f916664849aca
|
|
| MD5 |
ae764df725312e372fda1029bba2224b
|
|
| BLAKE2b-256 |
dcbedb841a380fd775a49befa1b66ff8342da8e2bb4b53eafb71d6cd88f46c5d
|
Provenance
The following attestation bundles were made for babappasnake-1.4.0.tar.gz:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-1.4.0.tar.gz -
Subject digest:
b0b44a498565ac5685fa92d62e7cbbcfe61cfd000b852bcedb5f916664849aca - Sigstore transparency entry: 1741460693
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@daa62fc94d3c5fef6a39ab4d13b7142ff1dfc5f1 -
Branch / Tag:
refs/tags/v1.4.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@daa62fc94d3c5fef6a39ab4d13b7142ff1dfc5f1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file babappasnake-1.4.0-py3-none-any.whl.
File metadata
- Download URL: babappasnake-1.4.0-py3-none-any.whl
- Upload date:
- Size: 78.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86a19955c52bec95bd21f59a955959f4ec960ea704a0122419b794658368aed4
|
|
| MD5 |
01940ec2f4849cd89d76f743ff1fb299
|
|
| BLAKE2b-256 |
44987bf9cc655e6e8c2cb772ac784b23c87726a77429da27655d4710d60e6ed5
|
Provenance
The following attestation bundles were made for babappasnake-1.4.0-py3-none-any.whl:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-1.4.0-py3-none-any.whl -
Subject digest:
86a19955c52bec95bd21f59a955959f4ec960ea704a0122419b794658368aed4 - Sigstore transparency entry: 1741460721
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@daa62fc94d3c5fef6a39ab4d13b7142ff1dfc5f1 -
Branch / Tag:
refs/tags/v1.4.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@daa62fc94d3c5fef6a39ab4d13b7142ff1dfc5f1 -
Trigger Event:
release
-
Statement type: