A simple Snakemake pipeline for episodic selection analysis

These details have not been verified by PyPI

Project description

BABAPPASnake

babappasnake is a reproducible command-line workflow for episodic positive selection analysis on one orthogroup at a time. It is designed for practical comparative genomics: resumable runs, interactive stepwise control, and robustness summaries across alignment and trimming choices.

Quick Start

1) Install tools and package

conda create -n babappasnake -c conda-forge -c bioconda \
  python=3.11 blast orthofinder iqtree hyphy paml clipkit mafft prank pip
conda activate babappasnake
pip install babappasnake

2) Run (non-interactive)

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --alignment-methods 4 \
  --outgroup culex \
  --outdir run01 \
  --threads 12 \
  --interactive no \
  --guided no

3) Run (interactive guided mode)

babappasnake

This prompts step by step, shows what each stage does, and supports run/skip/stop per stage.

Manual Contents

What The Pipeline Does
Installation And Environment
Input Requirements
Orthogroup Discovery Strategy
Running Modes
CDS Mapping And QC
Alignment, Trimming, Tree, And Selection Steps
ASR Extraction Of Selected Branches
Outputs And Directory Layout
Resume, Rerun, And Reproducibility
CLI Reference
Troubleshooting
Developer And Release Notes

What The Pipeline Does

A complete run performs these stages:

Build orthogroup candidates from your query and proteomes.
Select the best orthogroup strategy by strict 1:1 ortholog support.
Map user CDS to selected proteins with coding-quality filtering.
Build protein alignments with selected MSA engines.
Build codon alignments (native for BABAPPAlign; robust back-translation for MAFFT/PRANK).
Run both trimming states (raw and clipkit) for robustness.
Infer trees with IQ-TREE for each (method, trim_state) pathway.
Optionally root trees with outgroup text query.
Optionally run HyPhy GARD recombination screening per pathway.
Run HyPhy aBSREL + MEME per pathway.
Select foreground branches dynamically from aBSREL.
Run branch-site codeml per selected foreground.
Run codeml ASR per pathway.
Extract ancestor/descendant sequences and substitutions for selected branches.
Write pathway summaries plus cross-pathway robustness reports.

Installation And Environment

Required external tools

blastp
makeblastdb
orthofinder
iqtree (or iqtree2 or iqtree3)
hyphy
codeml (from paml)
clipkit

Optional but strongly recommended

mafft (if using alignment method 2 or 4)
prank (if using alignment method 3 or 4)

Python package

babappalign is installed automatically as a Python dependency of babappasnake.

Verify installation

babappasnake --help
which blastp makeblastdb orthofinder iqtree iqtree2 iqtree3 hyphy codeml clipkit
which babappalign mafft prank

Input Requirements

`--prot` proteomes directory

Directory of protein FASTA files, one file per species.
Supported extensions: .fa, .faa, .fasta (case-insensitive).
Hidden/macOS metadata files are ignored (e.g. .DS_Store, ._*, hidden files).
Empty or malformed FASTA files are skipped with warnings.

`--query` query FASTA

Protein FASTA with exactly one query sequence.

`--cds` CDS FASTA

Optional on first run.
If not provided initially, workflow stops after orthogroup definition and writes WAITING_FOR_CDS.txt.
Add CDS file at OUTDIR/user_supplied/orthogroup_cds.fasta and rerun.

`--outgroup` optional

String used to root tree by case-insensitive substring match on tip headers.
If omitted, or if the query is too broad / does not match, unrooted trees are used downstream.

Orthogroup Discovery Strategy

Default strategy (`--orthogroup-method rbh`)

In default mode, orthogroup selection uses RBH only:

Run RBH stage.
Compute strict 1:1 ortholog count from RBH output.
If RBH yields zero strict 1:1 orthologs, stop with explicit error.

Fallback comparison mode (`--orthogroup-method rbh_fallback`)

This preserves the previous behavior as an explicit opt-in:

Run RBH stage.
Run OrthoFinder stage.
Compute strict 1:1 ortholog count for each.
Select backend with larger strict 1:1 count.
If tied, keep RBH deterministically.
If both have zero strict 1:1 orthologs, stop with explicit error.

The selected backend and counts are printed explicitly.

How OrthoFinder query mapping is done

OrthoFinder query mapping is BLAST-based, not query-ID membership based:

Parse OrthoFinder Orthogroups.tsv to load all groups and members.
Build a combined FASTA of orthogroup-member proteins with subject IDs encoded as: <orthogroup>||<species>||<member>
Run blastp(query -> combined orthogroup members).
Filter by query coverage threshold.
Rank orthogroups by:
- number of species with passing hits,
- summed best bitscore per species,
- top bitscore,
- orthogroup ID (stable tie-break).
Select top-ranked orthogroup for downstream extraction.

Strict 1:1 rule used downstream

A species contributes only if exactly one ortholog is retained for that species. This guarantees no duplicate ortholog entries for the same species in the final selected orthogroup.

Direct OrthoFinder mode (`--orthogroup-method orthofinder`)

Runs OrthoFinder selection directly.
Uses the same BLAST-based query-to-orthogroup mapping and strict 1:1 filtering.

Running Modes

Interactive guided mode (default)

babappasnake

Behavior:

Prompts for required settings.
Executes one rule at a time.
Asks run/skip/stop for each stage.
Shows step outputs and previews.
Auto-skips already-completed steps safely.

Important guided-mode defaults:

Orthogroup backend defaults to rbh in interactive mode.
You can choose rbh, orthofinder, or rbh_fallback.
Trimming is forced to robustness mode (raw + clipkit) for comparative summaries.
CDS is asked only after orthogroup stage finishes.
Outgroup prompt comes after CDS prompt and is optional.

Non-interactive mode

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --outdir run01 \
  --threads 12 \
  --interactive no \
  --guided no

Use this for scripted runs and HPC wrappers.

Example: enable optional GARD screening

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --alignment-methods 4 \
  --recombination gard \
  --gard-mode Faster \
  --gard-rate-classes 3 \
  --outdir run01_gard \
  --threads 12 \
  --interactive no \
  --guided no

Two-stage run (no CDS at start)

Stage 1:

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --outdir run01 \
  --interactive no \
  --guided no

After stage 1 completes, add:

run01/user_supplied/orthogroup_cds.fasta

Then rerun same command to resume.

CDS Mapping And QC

During CDS mapping:

Lowercase intronic segments are clipped out.
Uppercase ORF window is retained.
Must start with uppercase start codon and end with uppercase stop codon.
Frame consistency checks are applied.
Failing CDS entries are excluded with warnings.

Outputs:

mapped_cds/mapped_orthogroup_cds.fasta
mapped_cds/mapped_orthogroup_proteins.fasta
mapped_cds/cds_protein_mapping.tsv

Alignment, Trimming, Tree, And Selection Steps

Alignment methods

--alignment-methods options:

1: babappalign
2: mafft
3: prank
4: all three

Trimming model

Pipeline is enforced to run both trim states for robustness:

raw
clipkit

So each selected method is expanded into two pathways. Example for method 4:

babappalign_raw
babappalign_clipkit
mafft_raw
mafft_clipkit
prank_raw
prank_clipkit

Tree inference and rooting

IQ-TREE runs per pathway.
Outgroup rooting is optional and applied if query text is supplied and matched.
If no outgroup is supplied/matched, downstream continues with unrooted tree.

Optional recombination screening (HyPhy GARD)

Controlled by --recombination {none,gard,auto} (default: none).
none: skip recombination screening (legacy/default behavior).
gard: run HyPhy GARD per active (method, trim_state) pathway.
auto: currently an alias of gard.
Additional controls:
- --gard-mode {Normal,Faster} (default: Faster)
- --gard-rate-classes INT (default: 3)

Output locations:

recombination/<method>/<trim_state>/gard/gard.json
recombination/<method>/<trim_state>/gard/gard_summary.json
recombination/<method>/<trim_state>/gard/gard.stdout.txt
recombination/<method>/<trim_state>/gard/gard.stderr.txt

Interpretation note:

In the current implementation, GARD is a screening/reporting module.
Downstream branch-site/codeml analyses remain full-length by default unless explicit fragment-aware routing is added in a future release.

HyPhy and branch-site selection

aBSREL and MEME run per pathway.
Foregrounds are selected by dynamic aBSREL threshold:
- start 0.05
- increment 0.01
- cap 0.2
Selected foregrounds feed branch-site codeml.
Branch-site outputs are BH-corrected.

ASR Extraction Of Selected Branches

After branch-site selection, extract_selected_branch_ancestors does:

Map selected branches onto canonical tree edges (parent_node -> child_node).
Recover ancestor and descendant CDS/AA sequences.
Compute codon and amino-acid substitutions per selected branch.
Annotate overlaps with MEME/BEB where available.
Write branch-level summary and provenance.

This stage is model-based reconstruction from codeml outputs.

Outputs And Directory Layout

All outputs are inside --outdir.

High-value files:

orthogroup/orthogroup_proteins.fasta
orthogroup/orthogroup_headers.txt
orthogroup/rbh_summary.tsv
mapped_cds/cds_protein_mapping.tsv
alignments/<method>/<trim_state>/...
tree/<method>/<trim_state>/orthogroup.treefile
tree/<method>/<trim_state>/orthogroup.rooted.treefile
hyphy/<method>/<trim_state>/absrel.json
hyphy/<method>/<trim_state>/meme.json
hyphy/<method>/<trim_state>/significant_foregrounds.tsv
recombination/<method>/<trim_state>/gard/gard_summary.json
branchsite/<method>/<trim_state>/branchsite_results.tsv
asr/<method>/<trim_state>/asr_done.json
asr/branch_to_nodes.tsv
asr/branch_substitutions.tsv
asr/selected_branch_asr_summary.tsv
summary/<method>/<trim_state>/episodic_selection_summary.txt
summary/robustness_matrix.tsv
summary/robustness_consensus.tsv
summary/robustness_narrative.txt
summary/comparative_reproducibility_summary.txt
summary/robustness_publication_table.tex
summary/run_provenance.json

Top-level compatibility aliases:

summary/episodic_selection_summary.txt
asr/asr_done.json

Resume, Rerun, And Reproducibility

Re-running the same command resumes existing work.
In guided mode, completed steps are auto-detected and skipped.
Final provenance is written to summary/run_provenance.json.
ASR extraction provenance is written to asr/asr_extraction_provenance.json.

CLI Reference

Basic form:

babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]

Core options:

--cds PATH
--orthogroup-method {rbh,orthofinder,rbh_fallback}
--alignment-methods {1,2,3,4}
--outgroup TEXT
--outdir PATH
--threads INT
--interactive {yes,no}
--guided {yes,no}

Selection and model options:

--coverage FLOAT (RBH/BLAST mapping coverage threshold)
--iqtree-bootstrap INT
--iqtree-bnni {yes,no}
--iqtree-model TEXT
--absrel-branches TEXT
--meme-branches TEXT
--codeml-codonfreq INT
--recombination {none,gard,auto}
--gard-mode {Normal,Faster}
--gard-rate-classes INT
--absrel-p FLOAT
--absrel-dynamic-start FLOAT
--absrel-dynamic-step FLOAT
--absrel-dynamic-max FLOAT
--meme-p FLOAT
--clipkit-mode-protein TEXT
--clipkit-mode-codon TEXT
--snake-args "..."

Compatibility options:

--trim-strategy {raw,clipkit,both} accepted, but runtime robustness mode is enforced to both.
--use-clipkit {yes,no} retained for backward compatibility.

Troubleshooting

"Missing required external tools"

Install missing binaries in the active environment and verify with which.

Run stops at `WAITING_FOR_CDS.txt`

This is expected when CDS is not yet supplied. Add user_supplied/orthogroup_cds.fasta and rerun.

OrthoFinder/RBH finds no usable orthogroup

Pipeline stops explicitly if both strategies yield zero strict 1:1 ortholog support. Check proteome quality, query quality, and species composition.

macOS metadata files in proteomes

You can clean sidecar files before running:

find /path/to/proteomes -type f \( -name '._*' -o -name '.DS_Store' \) -delete

codeml warnings

Warnings are tolerated when required output files are present. Hard failure occurs only when mandatory result files are missing.

Developer And Release Notes

Local editable install

pip install -e .

Build package

python -m pip install --upgrade build twine
python -m build --sdist --wheel
twine check dist/*

Publish to PyPI

twine upload dist/*

Release checklist

Update version in pyproject.toml.
Run tests.
Build distributions.
Publish to PyPI.
Push Git tag and GitHub release.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.4.0

Jun 7, 2026

1.3.4

May 4, 2026

This version

1.3.0

May 1, 2026

1.2.0

Apr 30, 2026

1.1.1

Apr 4, 2026

1.0.0

Apr 4, 2026

0.9.3

Apr 3, 2026

0.9.2

Mar 25, 2026

0.9.1

Mar 24, 2026

0.9.0

Mar 24, 2026

0.8.1

Mar 24, 2026

0.8.0

Mar 24, 2026

0.7.2

Mar 24, 2026

0.7.1

Mar 24, 2026

0.7.0

Mar 23, 2026

0.6.0

Mar 23, 2026

0.5.0

Mar 23, 2026

0.4.1

Mar 23, 2026

0.4.0

Mar 23, 2026

0.3.1

Mar 23, 2026

0.2.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappasnake-1.3.0.tar.gz (72.5 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

babappasnake-1.3.0-py3-none-any.whl (73.0 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file babappasnake-1.3.0.tar.gz.

File metadata

Download URL: babappasnake-1.3.0.tar.gz
Upload date: May 1, 2026
Size: 72.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babappasnake-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`5dce00eca31c5e3fc40c2f549505d82f65caa318aa0686cf4c10af93bc47ff3c`
MD5	`b196ed88c833f52845763b15c67078fc`
BLAKE2b-256	`2900eff1994faf5f98e79f4e100153d9a73357173bf523ef1c89bbfd45793fe2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-1.3.0.tar.gz:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babappasnake-1.3.0.tar.gz
- Subject digest: 5dce00eca31c5e3fc40c2f549505d82f65caa318aa0686cf4c10af93bc47ff3c
- Sigstore transparency entry: 1417791918
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: sinhakrishnendu/babappasnake@105023b1fa4ee92390c83ef46e7b09f919858823
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/sinhakrishnendu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@105023b1fa4ee92390c83ef46e7b09f919858823
- Trigger Event: release

File details

Details for the file babappasnake-1.3.0-py3-none-any.whl.

File metadata

Download URL: babappasnake-1.3.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 73.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for babappasnake-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d8cba3db23cfe24a5f82584b25a585428d94c27be4fde75d73a98dc2074feb45`
MD5	`105ecdf1c0d50b9874ccc7ad945a793c`
BLAKE2b-256	`6e41f402729de1eefb404286d759f2622e0e0e89c19567ec0e75c5c7ef41456f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-1.3.0-py3-none-any.whl:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babappasnake-1.3.0-py3-none-any.whl
- Subject digest: d8cba3db23cfe24a5f82584b25a585428d94c27be4fde75d73a98dc2074feb45
- Sigstore transparency entry: 1417791931
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: sinhakrishnendu/babappasnake@105023b1fa4ee92390c83ef46e7b09f919858823
- Branch / Tag: refs/tags/v1.3.0
- Owner: https://github.com/sinhakrishnendu
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@105023b1fa4ee92390c83ef46e7b09f919858823
- Trigger Event: release

babappasnake 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

BABAPPASnake

Quick Start

1) Install tools and package

2) Run (non-interactive)

3) Run (interactive guided mode)

Manual Contents

What The Pipeline Does

Installation And Environment

Required external tools

Optional but strongly recommended

Python package

Verify installation

Input Requirements

--prot proteomes directory

--query query FASTA

--cds CDS FASTA

--outgroup optional

Orthogroup Discovery Strategy

Default strategy (--orthogroup-method rbh)

Fallback comparison mode (--orthogroup-method rbh_fallback)

How OrthoFinder query mapping is done

Strict 1:1 rule used downstream

Direct OrthoFinder mode (--orthogroup-method orthofinder)

Running Modes

Interactive guided mode (default)

Non-interactive mode

Example: enable optional GARD screening

Two-stage run (no CDS at start)

CDS Mapping And QC

Alignment, Trimming, Tree, And Selection Steps

Alignment methods

Trimming model

Tree inference and rooting

Optional recombination screening (HyPhy GARD)

HyPhy and branch-site selection

ASR Extraction Of Selected Branches

Outputs And Directory Layout

Resume, Rerun, And Reproducibility

CLI Reference

Troubleshooting

"Missing required external tools"

Run stops at WAITING_FOR_CDS.txt

OrthoFinder/RBH finds no usable orthogroup

macOS metadata files in proteomes

codeml warnings

Developer And Release Notes

Local editable install

Build package

Publish to PyPI

Release checklist

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`--prot` proteomes directory

`--query` query FASTA

`--cds` CDS FASTA

`--outgroup` optional

Default strategy (`--orthogroup-method rbh`)

Fallback comparison mode (`--orthogroup-method rbh_fallback`)

Direct OrthoFinder mode (`--orthogroup-method orthofinder`)

Run stops at `WAITING_FOR_CDS.txt`