A simple Snakemake pipeline for episodic selection analysis
Project description
babappasnake
babappasnake is a command-line workflow for episodic positive selection analysis on a single orthogroup.
It is designed for practical use: one command to launch, automatic checkpointing, resumable execution, and clear summary outputs.
In terminal mode, it can run as an interactive guided engine instead of a black-box single-shot command.
What the pipeline does
- Runs reciprocal best-hit (RBH) ortholog discovery.
- Builds an orthogroup from your query and proteomes.
- Maps user CDS records to orthogroup proteins (after lowercase intron clipping and uppercase ORF window extraction).
- Creates protein and codon alignments with
babappalign. - Trims alignments with ClipKIT (
kpic-smart-gap). - Removes terminal stop codon artifacts after ClipKIT on the codon alignment.
- Infers an ML tree with IQ-TREE (
-m MFP -B 1000 -redo). - Roots the inferred tree using a user-supplied outgroup label query (case-insensitive header matching).
- Runs HyPhy aBSREL and MEME with user-selected branch scopes (default
Leaves). - Selects foreground branches from aBSREL using dynamic thresholding.
- Runs branch-site
codemlonly for selected branches (alt and null models). - Runs codeml ancestral sequence reconstruction (ASR).
- Produces final summary and tabular outputs.
Installation (for end users)
pip installs Python packages, but external bioinformatics binaries must also be available on PATH.
Recommended setup (conda + pip)
conda create -n babappasnake -c conda-forge -c bioconda \
python=3.11 blast iqtree hyphy paml clipkit pip
conda activate babappasnake
pip install babappalign babappasnake
Quick verification
babappasnake --help
which blastp makeblastdb hyphy codeml clipkit babappalign
Notes:
- IQ-TREE binary detection is flexible (
iqtree2,iqtree3, oriqtree). - On Apple Silicon,
iqtree3is common and is accepted automatically.
Input requirements
--prot: directory containing proteome FASTA files.--query: protein FASTA containing the query sequence.--cds(optional at first run): CDS FASTA for the orthogroup.--outgroup: outgroup query string used to root the IQ-TREE output (e.g.,culexmatches headers containingculex).
CDS quality checks (when --cds is supplied):
- Lowercase intron characters are clipped from each CDS.
- For each CDS, the best uppercase
ATG ... STOPORF window is retained. - CDS records that still fail ORF/start-stop/frame checks are excluded.
- Proteins without a qualifying CDS match are skipped from codon/tree downstream analyses.
Quick start
Guided interactive mode (default in terminal)
babappasnake
This mode prompts for pipeline settings, executes one rule at a time, asks run/skip/stop at every step, and prints per-step output previews.
It asks for CDS only after rbh_orthogroup finishes, then asks optional outgroup text for rooting.
It also prints explicit orthogroup membership in terminal: groups included and groups omitted at RBH stage.
If outgroup is left empty, root_iqtree_outgroup is safely skippable in guided mode and downstream uses the unrooted IQ-TREE output.
Case A: you already have the CDS file
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--outgroup culex \
--outdir run01 \
--threads 8 \
--interactive no \
--guided no
Case B: two-stage run (CDS provided later)
Run once:
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--outgroup culex \
--outdir run01 \
--threads 8 \
--interactive no \
--guided no
The first run stops intentionally and writes:
run01/orthogroup/orthogroup_proteins.fastarun01/orthogroup/orthogroup_headers.txtrun01/orthogroup/WAITING_FOR_CDS.txt
Then place your CDS FASTA at:
run01/user_supplied/orthogroup_cds.fasta
Re-run the same command. The workflow resumes automatically.
Dynamic significance logic
Foreground selection from aBSREL uses dynamic p-thresholding:
- start at
p <= 0.05 - if no branch passes, increase by
0.01 - stop as soon as at least one branch is found
- hard upper bound:
1.0
Only those selected branches go to branch-site codeml.
Each selected branch runs two codeml fits (alternative + null), then BH-FDR correction is applied.
Output guide
All outputs are written under --outdir (default: babappasnake_run).
Most important files:
summary/episodic_selection_summary.txt: human-readable final report.hyphy/foreground_threshold.json: selected dynamic threshold and hit count.hyphy/significant_foregrounds.tsv: selected aBSREL foreground branches.branchsite/branchsite_results.tsv: codeml branch-site statistics and BH significance.asr/asr_done.json: ASR completion record.asr/mlc_asr.txt: codeml ASR main output.asr/rst: reconstructed ancestral states.tree/orthogroup.treefile: inferred ML tree (unrooted IQ-TREE output).tree/orthogroup.rooted.treefile: rooted tree used by HyPhy and codeml downstream steps.
CLI reference
babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]
Options:
--cds PATH: CDS FASTA (optional for initial run).--outgroup TEXT: outgroup query used for tree rooting (case-insensitive substring match against tip headers).--outdir PATH: output directory (default:babappasnake_run).--coverage FLOAT: RBH reciprocal coverage minimum (default:0.70).--threads INT: parallel threads/cores (default:4).--iqtree-bootstrap INT: UFBoot replicates for IQ-TREE (default:1000; typical options:1000,5000,10000).--iqtree-bnni {yes,no}: enable/disable IQ-TREE-bnni(default:no).--iqtree-model TEXT: IQ-TREE model string (default:MFP).--absrel-branches TEXT: HyPhy aBSREL branch selector (default:Leaves; common choices:Leaves,Internal,All).--meme-branches TEXT: HyPhy MEME branch selector (default:Leaves; common choices:Leaves,Internal,All).--codeml-codonfreq INT: codemlCodonFreqvalue for branch-site and ASR runs (default:2; e.g.1,2,7).--absrel-p FLOAT: compatibility parameter retained in config (dynamic mode is used for branch selection).--absrel-dynamic-start FLOAT: dynamic foreground start p-value (default:0.05).--absrel-dynamic-step FLOAT: dynamic foreground increment (default:0.01).--absrel-dynamic-max FLOAT: dynamic foreground max p-value (default:1.0).--meme-p FLOAT: MEME reporting threshold in summary (default:0.1).--use-clipkit {yes,no}: enable/disable ClipKIT (default:yes).--clipkit-mode-protein TEXT: ClipKIT mode for protein trimming (default:kpic-smart-gap).--clipkit-mode-codon TEXT: ClipKIT mode for codon trimming (default:kpic-smart-gap).--snake-args "...": extra raw arguments forwarded to Snakemake.--interactive {yes,no}: prompt for settings at launch (default:yes).--guided {yes,no}: execute one rule at a time with confirmation (default:yes).
Example with additional Snakemake flags:
babappasnake \
--prot /path/to/proteomes \
--query /path/to/query.fasta \
--cds /path/to/orthogroup_cds.fasta \
--outgroup culex \
--outdir run01 \
--threads 8 \
--snake-args "--keep-going"
Troubleshooting
"Missing required external tools"
Install missing binaries and ensure they are on PATH in the same shell where you run babappasnake.
Run stops with WAITING_FOR_CDS.txt
This is expected for two-stage usage.
Add user_supplied/orthogroup_cds.fasta and re-run.
codeml returns non-zero or writes warnings
The workflow accepts codeml warning-heavy runs when valid output files are produced. Hard failure is triggered only when required codeml result files are missing.
Can I resume after interruption?
Yes. Re-run the same command; Snakemake resumes and re-runs incomplete jobs as needed.
Developer/local source install
For local development:
pip install -e .
The package entry-point command is still babappasnake.
Maintainer release checklist (GitHub + PyPI)
- Update version in
pyproject.toml. - Commit and push to GitHub.
- Build distributions:
python -m pip install --upgrade build twine
python -m build
twine check dist/*
- Publish to PyPI:
twine upload dist/*
- Optionally create and push a matching git tag.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babappasnake-0.7.0.tar.gz.
File metadata
- Download URL: babappasnake-0.7.0.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37918d0e77d3acf8c651c2ec5121d8e646e464cec1c3c27f512a526928503ce9
|
|
| MD5 |
844196d5bfdb2c31b14d2eca624b2221
|
|
| BLAKE2b-256 |
1bcddc7f66568014b0f590b6f5ade01698ade44e5fe77d7a35596ecada51a7e7
|
Provenance
The following attestation bundles were made for babappasnake-0.7.0.tar.gz:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-0.7.0.tar.gz -
Subject digest:
37918d0e77d3acf8c651c2ec5121d8e646e464cec1c3c27f512a526928503ce9 - Sigstore transparency entry: 1164057998
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@e9951de9a8a97f46d753a42c5223a91d8c8aacfa -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e9951de9a8a97f46d753a42c5223a91d8c8aacfa -
Trigger Event:
release
-
Statement type:
File details
Details for the file babappasnake-0.7.0-py3-none-any.whl.
File metadata
- Download URL: babappasnake-0.7.0-py3-none-any.whl
- Upload date:
- Size: 34.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8cfcb2f05eac7b341fef805309d709d14cdb090e3f4e532ec06c03fae2ca0a8
|
|
| MD5 |
4644f16cc4dfe788558b3752a840a457
|
|
| BLAKE2b-256 |
d248dd9a02aed445bf7e8db6e4e2800f914a2e6e95e13959c28da589219024fe
|
Provenance
The following attestation bundles were made for babappasnake-0.7.0-py3-none-any.whl:
Publisher:
python-publish.yml on sinhakrishnendu/babappasnake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babappasnake-0.7.0-py3-none-any.whl -
Subject digest:
f8cfcb2f05eac7b341fef805309d709d14cdb090e3f4e532ec06c03fae2ca0a8 - Sigstore transparency entry: 1164058138
- Sigstore integration time:
-
Permalink:
sinhakrishnendu/babappasnake@e9951de9a8a97f46d753a42c5223a91d8c8aacfa -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/sinhakrishnendu
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e9951de9a8a97f46d753a42c5223a91d8c8aacfa -
Trigger Event:
release
-
Statement type: