A simple Snakemake pipeline for episodic selection analysis
Project description
BABAPPASnake
babappasnake is a reproducible command-line workflow for episodic positive selection analysis on one orthogroup at a time.
It is designed for practical comparative genomics: resumable runs, interactive stepwise control, and robustness summaries across alignment and trimming choices.
Quick Start
1) Install tools and package
conda create -n babappasnake -c conda-forge -c bioconda \
python=3.11 blast orthofinder iqtree hyphy paml clipkit mafft prank pip
conda activate babappasnake
pip install babappasnake
2) Run (non-interactive)
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--alignment-methods 4 \
--outgroup culex \
--outdir run01 \
--threads 12 \
--interactive no \
--guided no
3) Run (interactive guided mode)
babappasnake
This prompts step by step, shows what each stage does, and supports run/skip/stop per stage.
Manual Contents
- What The Pipeline Does
- Installation And Environment
- Input Requirements
- Orthogroup Discovery Strategy
- Running Modes
- CDS Mapping And QC
- Alignment, Trimming, Tree, And Selection Steps
- ASR Extraction Of Selected Branches
- Outputs And Directory Layout
- Resume, Rerun, And Reproducibility
- CLI Reference
- Troubleshooting
- Developer And Release Notes
What The Pipeline Does
A complete run performs these stages:
- Build orthogroup candidates from your query and proteomes.
- Select the best orthogroup strategy by strict 1:1 ortholog support.
- Map user CDS to selected proteins with coding-quality filtering.
- Build protein alignments with selected MSA engines.
- Build codon alignments (native for BABAPPAlign; robust back-translation for MAFFT/PRANK).
- Run both trimming states (
rawandclipkit) for robustness. - Infer trees with IQ-TREE for each
(method, trim_state)pathway. - Optionally root trees with outgroup text query.
- Run HyPhy aBSREL + MEME per pathway.
- Select foreground branches dynamically from aBSREL.
- Run branch-site codeml per selected foreground.
- Run codeml ASR per pathway.
- Extract ancestor/descendant sequences and substitutions for selected branches.
- Write pathway summaries plus cross-pathway robustness reports.
Installation And Environment
Required external tools
blastpmakeblastdborthofinderiqtree(oriqtree2oriqtree3)hyphycodeml(frompaml)clipkit
Optional but strongly recommended
mafft(if using alignment method 2 or 4)prank(if using alignment method 3 or 4)
Python package
babappalign is installed automatically as a Python dependency of babappasnake.
Verify installation
babappasnake --help
which blastp makeblastdb orthofinder iqtree iqtree2 iqtree3 hyphy codeml clipkit
which babappalign mafft prank
Input Requirements
--prot proteomes directory
- Directory of protein FASTA files, one file per species.
- Supported extensions:
.fa,.faa,.fasta(case-insensitive). - Hidden/macOS metadata files are ignored (e.g.
.DS_Store,._*, hidden files). - Empty or malformed FASTA files are skipped with warnings.
--query query FASTA
- Protein FASTA with exactly one query sequence.
--cds CDS FASTA
- Optional on first run.
- If not provided initially, workflow stops after orthogroup definition and writes
WAITING_FOR_CDS.txt. - Add CDS file at
OUTDIR/user_supplied/orthogroup_cds.fastaand rerun.
--outgroup optional
- String used to root tree by case-insensitive substring match on tip headers.
- If omitted, unrooted trees are used downstream.
Orthogroup Discovery Strategy
Default strategy (--orthogroup-method rbh)
In default mode, orthogroup selection is a two-backend comparison:
- Run RBH stage.
- Run OrthoFinder stage.
- Compute strict 1:1 ortholog count for each.
- Select backend with larger strict 1:1 count.
- If tied, keep RBH deterministically.
- If both have zero strict 1:1 orthologs, stop with explicit error.
The selected backend and counts are printed explicitly.
How OrthoFinder query mapping is done
OrthoFinder query mapping is BLAST-based, not query-ID membership based:
- Parse OrthoFinder
Orthogroups.tsvto load all groups and members. - Build a combined FASTA of orthogroup-member proteins with subject IDs encoded as:
<orthogroup>||<species>||<member> - Run
blastp(query -> combined orthogroup members). - Filter by query coverage threshold.
- Rank orthogroups by:
- number of species with passing hits,
- summed best bitscore per species,
- top bitscore,
- orthogroup ID (stable tie-break).
- Select top-ranked orthogroup for downstream extraction.
Strict 1:1 rule used downstream
A species contributes only if exactly one ortholog is retained for that species. This guarantees no duplicate ortholog entries for the same species in the final selected orthogroup.
Compatibility mode (--orthogroup-method orthofinder)
- Runs OrthoFinder selection directly.
- Uses the same BLAST-based query-to-orthogroup mapping and strict 1:1 filtering.
Running Modes
Interactive guided mode (default)
babappasnake
Behavior:
- Prompts for required settings.
- Executes one rule at a time.
- Asks
run/skip/stopfor each stage. - Shows step outputs and previews.
- Auto-skips already-completed steps safely.
Important guided-mode defaults:
- Orthogroup backend is fixed to
rbhin interactive mode. - RBH is always compared against OrthoFinder and the better strict 1:1 result is selected.
- Trimming is forced to robustness mode (
raw + clipkit) for comparative summaries. - CDS is asked only after orthogroup stage finishes.
- Outgroup prompt comes after CDS prompt and is optional.
Non-interactive mode
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--outdir run01 \
--threads 12 \
--interactive no \
--guided no
Use this for scripted runs and HPC wrappers.
Two-stage run (no CDS at start)
Stage 1:
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--outdir run01 \
--interactive no \
--guided no
After stage 1 completes, add:
run01/user_supplied/orthogroup_cds.fasta
Then rerun same command to resume.
CDS Mapping And QC
During CDS mapping:
- Lowercase intronic segments are clipped out.
- Uppercase ORF window is retained.
- Must start with uppercase start codon and end with uppercase stop codon.
- Frame consistency checks are applied.
- Failing CDS entries are excluded with warnings.
Outputs:
mapped_cds/mapped_orthogroup_cds.fastamapped_cds/mapped_orthogroup_proteins.fastamapped_cds/cds_protein_mapping.tsv
Alignment, Trimming, Tree, And Selection Steps
Alignment methods
--alignment-methods options:
1:babappalign2:mafft3:prank4: all three
Trimming model
Pipeline is enforced to run both trim states for robustness:
rawclipkit
So each selected method is expanded into two pathways.
Example for method 4:
babappalign_rawbabappalign_clipkitmafft_rawmafft_clipkitprank_rawprank_clipkit
Tree inference and rooting
- IQ-TREE runs per pathway.
- Outgroup rooting is optional and applied if query text is supplied and matched.
- If no outgroup is supplied/matched, downstream continues with unrooted tree.
HyPhy and branch-site selection
- aBSREL and MEME run per pathway.
- Foregrounds are selected by dynamic aBSREL threshold:
- start
0.05 - increment
0.01 - cap
0.2
- start
- Selected foregrounds feed branch-site codeml.
- Branch-site outputs are BH-corrected.
ASR Extraction Of Selected Branches
After branch-site selection, extract_selected_branch_ancestors does:
- Map selected branches onto canonical tree edges
(parent_node -> child_node). - Recover ancestor and descendant CDS/AA sequences.
- Compute codon and amino-acid substitutions per selected branch.
- Annotate overlaps with MEME/BEB where available.
- Write branch-level summary and provenance.
This stage is model-based reconstruction from codeml outputs.
Outputs And Directory Layout
All outputs are inside --outdir.
High-value files:
orthogroup/orthogroup_proteins.fastaorthogroup/orthogroup_headers.txtorthogroup/rbh_summary.tsvmapped_cds/cds_protein_mapping.tsvalignments/<method>/<trim_state>/...tree/<method>/<trim_state>/orthogroup.treefiletree/<method>/<trim_state>/orthogroup.rooted.treefilehyphy/<method>/<trim_state>/absrel.jsonhyphy/<method>/<trim_state>/meme.jsonhyphy/<method>/<trim_state>/significant_foregrounds.tsvbranchsite/<method>/<trim_state>/branchsite_results.tsvasr/<method>/<trim_state>/asr_done.jsonasr/branch_to_nodes.tsvasr/branch_substitutions.tsvasr/selected_branch_asr_summary.tsvsummary/<method>/<trim_state>/episodic_selection_summary.txtsummary/robustness_matrix.tsvsummary/robustness_consensus.tsvsummary/robustness_narrative.txtsummary/comparative_reproducibility_summary.txtsummary/robustness_publication_table.texsummary/run_provenance.json
Top-level compatibility aliases:
summary/episodic_selection_summary.txtasr/asr_done.json
Resume, Rerun, And Reproducibility
- Re-running the same command resumes existing work.
- In guided mode, completed steps are auto-detected and skipped.
- Final provenance is written to
summary/run_provenance.json. - ASR extraction provenance is written to
asr/asr_extraction_provenance.json.
CLI Reference
Basic form:
babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]
Core options:
--cds PATH--orthogroup-method {rbh,orthofinder}--alignment-methods {1,2,3,4}--outgroup TEXT--outdir PATH--threads INT--interactive {yes,no}--guided {yes,no}
Selection and model options:
--coverage FLOAT(RBH/BLAST mapping coverage threshold)--iqtree-bootstrap INT--iqtree-bnni {yes,no}--iqtree-model TEXT--absrel-branches TEXT--meme-branches TEXT--codeml-codonfreq INT--absrel-p FLOAT--absrel-dynamic-start FLOAT--absrel-dynamic-step FLOAT--absrel-dynamic-max FLOAT--meme-p FLOAT--clipkit-mode-protein TEXT--clipkit-mode-codon TEXT--snake-args "..."
Compatibility options:
--trim-strategy {raw,clipkit,both}accepted, but runtime robustness mode is enforced toboth.--use-clipkit {yes,no}retained for backward compatibility.
Troubleshooting
"Missing required external tools"
Install missing binaries in the active environment and verify with which.
Run stops at WAITING_FOR_CDS.txt
This is expected when CDS is not yet supplied.
Add user_supplied/orthogroup_cds.fasta and rerun.
OrthoFinder/RBH finds no usable orthogroup
Pipeline stops explicitly if both strategies yield zero strict 1:1 ortholog support. Check proteome quality, query quality, and species composition.
macOS metadata files in proteomes
You can clean sidecar files before running:
find /path/to/proteomes -type f \( -name '._*' -o -name '.DS_Store' \) -delete
codeml warnings
Warnings are tolerated when required output files are present. Hard failure occurs only when mandatory result files are missing.
Developer And Release Notes
Local editable install
pip install -e .
Build package
python -m pip install --upgrade build twine
python -m build --sdist --wheel
twine check dist/*
Publish to PyPI
twine upload dist/*
Release checklist
- Update version in
pyproject.toml. - Run tests.
- Build distributions.
- Publish to PyPI.
- Push Git tag and GitHub release.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babappasnake-1.0.0.tar.gz.
File metadata
- Download URL: babappasnake-1.0.0.tar.gz
- Upload date:
- Size: 64.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27ffa62ad616636ccd45de1dea20009faa6b335a08e00aa15f95d2438e52d5cd
|
|
| MD5 |
5cb2205b1d9965a27fa0d0b398760c51
|
|
| BLAKE2b-256 |
11c12e8080f6ab52eef708f3340016e4f8908ce5035f95907dc649d8f9b1f3ee
|
Provenance
The following attestation bundles were made for babappasnake-1.0.0.tar.gz:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-1.0.0.tar.gz -
Subject digest:
27ffa62ad616636ccd45de1dea20009faa6b335a08e00aa15f95d2438e52d5cd - Sigstore transparency entry: 1232747964
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@b547b0319e7c647fa1fa6263a92814284e47995d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b547b0319e7c647fa1fa6263a92814284e47995d -
Trigger Event:
release
-
Statement type:
File details
Details for the file babappasnake-1.0.0-py3-none-any.whl.
File metadata
- Download URL: babappasnake-1.0.0-py3-none-any.whl
- Upload date:
- Size: 67.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23203b37d08e839c85735b5a92d26df436890f74f7441e2d62043074f3a72d6c
|
|
| MD5 |
0f4afcb9a1415653978a0d326462a179
|
|
| BLAKE2b-256 |
1c5ac020e9e66c6d71ab673cb228eb054a588a14f4ac60551336cdb8c5fc8705
|
Provenance
The following attestation bundles were made for babappasnake-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-1.0.0-py3-none-any.whl -
Subject digest:
23203b37d08e839c85735b5a92d26df436890f74f7441e2d62043074f3a72d6c - Sigstore transparency entry: 1232747986
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@b547b0319e7c647fa1fa6263a92814284e47995d -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@b547b0319e7c647fa1fa6263a92814284e47995d -
Trigger Event:
release
-
Statement type: