Skip to main content

A simple Snakemake pipeline for episodic selection analysis

Project description

babappasnake

babappasnake is a command-line workflow for episodic positive selection analysis on a single orthogroup. It is designed for practical use: one command to launch, automatic checkpointing, resumable execution, and clear summary outputs.

What the pipeline does

  1. Runs reciprocal best-hit (RBH) ortholog discovery.
  2. Builds an orthogroup from your query and proteomes.
  3. Maps user CDS records to orthogroup proteins.
  4. Creates protein and codon alignments with babappalign.
  5. Trims alignments with ClipKIT (kpic-smart-gap).
  6. Removes terminal stop codon artifacts after ClipKIT on the codon alignment.
  7. Infers an ML tree with IQ-TREE (-m MFP -B 1000 -redo).
  8. Runs HyPhy aBSREL and MEME on leaf branches (--branches Leaves).
  9. Selects foreground branches from aBSREL using dynamic thresholding.
  10. Runs branch-site codeml only for selected branches (alt and null models).
  11. Runs codeml ancestral sequence reconstruction (ASR).
  12. Produces final summary and tabular outputs.

Installation (for end users)

pip installs Python packages, but external bioinformatics binaries must also be available on PATH.

Recommended setup (conda + pip)

conda create -n babappasnake -c conda-forge -c bioconda \
  python=3.11 blast iqtree hyphy paml clipkit pip
conda activate babappasnake
pip install babappalign babappasnake

Quick verification

babappasnake --help
which blastp makeblastdb hyphy codeml clipkit babappalign

Notes:

  • IQ-TREE binary detection is flexible (iqtree2, iqtree3, or iqtree).
  • On Apple Silicon, iqtree3 is common and is accepted automatically.

Input requirements

  • --prot: directory containing proteome FASTA files.
  • --query: protein FASTA containing the query sequence.
  • --cds (optional at first run): CDS FASTA for the orthogroup.

Quick start

Case A: you already have the CDS file

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --outdir run01 \
  --threads 8

Case B: two-stage run (CDS provided later)

Run once:

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --outdir run01 \
  --threads 8

The first run stops intentionally and writes:

  • run01/orthogroup/orthogroup_proteins.fasta
  • run01/orthogroup/orthogroup_headers.txt
  • run01/orthogroup/WAITING_FOR_CDS.txt

Then place your CDS FASTA at:

  • run01/user_supplied/orthogroup_cds.fasta

Re-run the same command. The workflow resumes automatically.

Dynamic significance logic

Foreground selection from aBSREL uses dynamic p-thresholding:

  • start at p <= 0.05
  • if no branch passes, increase by 0.01
  • stop as soon as at least one branch is found
  • hard upper bound: 1.0

Only those selected branches go to branch-site codeml. Each selected branch runs two codeml fits (alternative + null), then BH-FDR correction is applied.

Output guide

All outputs are written under --outdir (default: babappasnake_run).

Most important files:

  • summary/episodic_selection_summary.txt: human-readable final report.
  • hyphy/foreground_threshold.json: selected dynamic threshold and hit count.
  • hyphy/significant_foregrounds.tsv: selected aBSREL foreground branches.
  • branchsite/branchsite_results.tsv: codeml branch-site statistics and BH significance.
  • asr/asr_done.json: ASR completion record.
  • asr/mlc_asr.txt: codeml ASR main output.
  • asr/rst: reconstructed ancestral states.
  • tree/orthogroup.treefile: inferred ML tree.

CLI reference

babappasnake --prot PROTEOMES_DIR --query QUERY_FASTA [options]

Options:

  • --cds PATH: CDS FASTA (optional for initial run).
  • --outdir PATH: output directory (default: babappasnake_run).
  • --coverage FLOAT: RBH reciprocal coverage minimum (default: 0.70).
  • --threads INT: parallel threads/cores (default: 4).
  • --absrel-p FLOAT: compatibility parameter retained in config (dynamic mode is used for branch selection).
  • --meme-p FLOAT: MEME reporting threshold in summary (default: 0.1).
  • --use-clipkit {yes,no}: enable/disable ClipKIT (default: yes).
  • --snake-args "...": extra raw arguments forwarded to Snakemake.

Example with additional Snakemake flags:

babappasnake \
  --prot /path/to/proteomes \
  --query /path/to/query.fasta \
  --cds /path/to/orthogroup_cds.fasta \
  --outdir run01 \
  --threads 8 \
  --snake-args "--keep-going"

Troubleshooting

"Missing required external tools"

Install missing binaries and ensure they are on PATH in the same shell where you run babappasnake.

Run stops with WAITING_FOR_CDS.txt

This is expected for two-stage usage. Add user_supplied/orthogroup_cds.fasta and re-run.

codeml returns non-zero or writes warnings

The workflow accepts codeml warning-heavy runs when valid output files are produced. Hard failure is triggered only when required codeml result files are missing.

Can I resume after interruption?

Yes. Re-run the same command; Snakemake resumes and re-runs incomplete jobs as needed.

Developer/local source install

For local development:

pip install -e .

The package entry-point command is still babappasnake.

Maintainer release checklist (GitHub + PyPI)

  1. Update version in pyproject.toml.
  2. Commit and push to GitHub.
  3. Build distributions:
python -m pip install --upgrade build twine
python -m build
twine check dist/*
  1. Publish to PyPI:
twine upload dist/*
  1. Optionally create and push a matching git tag.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappasnake-0.3.1.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappasnake-0.3.1-py3-none-any.whl (24.7 kB view details)

Uploaded Python 3

File details

Details for the file babappasnake-0.3.1.tar.gz.

File metadata

  • Download URL: babappasnake-0.3.1.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babappasnake-0.3.1.tar.gz
Algorithm Hash digest
SHA256 caa8b2947cd21c32733a6e0020494d500340f5a4d35c53533e1acdbec994d794
MD5 e9cd92ff41cb01b14a4e243428d7024d
BLAKE2b-256 6754ce1deb7c0c44855e19964619482e24dc468067f07fd444203adfa74273ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-0.3.1.tar.gz:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babappasnake-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: babappasnake-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babappasnake-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e6fbbf1c6efee444a88c167cb91e5cf74c88878b50300f6896c065a65eff977e
MD5 f069a8ee1a99a41e418a0d0b70e747f4
BLAKE2b-256 dfe94b59af81988039d8394e3d9130c8ff2d954d243ad48301790c5c18e30d37

See more details on using hashes here.

Provenance

The following attestation bundles were made for babappasnake-0.3.1-py3-none-any.whl:

Publisher: python-publish.yml on sinhakrishnendu/babappasnake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page