Skip to main content

Simulation-trained branch-site selection support from user-supplied codon MSAs and trees

Project description

BABAPPA

BABAPPA is the Branch-site Alignment-Bias-Aware Probabilistic Positive-selection Analyzer.

Current source version: 0.5.2-alpha
Release archive label: 0.5.0-alpha
Status: research-alpha, simulation-trained, guarded empirical diagnostic workflow

BABAPPA supports branch-site positive-selection investigation from a user-supplied codon MSA and treefile. The main user-facing command treats the supplied MSA as the authoritative alignment, scores requested foreground branches, and reports candidate branch-site episodic-selection support using a deployable simulation-trained model. Alignment ensembles, codeml/HyPhy comparison, and matched-null calibration are optional diagnostic layers for deeper evaluation.

BABAPPA is not a finalized empirical positive-selection discovery engine. Empirical positive-selection claims remain blocked until simulation-matched null calibration, reference-tool comparison, biological controls, and dataset-specific interpretation are complete.

Contents

  • Project status and scientific boundary
  • What BABAPPA does
  • What BABAPPA does not do
  • Installation
  • Quick start
  • Typical workflows
  • Input requirements
  • Aligners
  • Output interpretation
  • Reproducibility
  • Storage cleanup and maintenance
  • Troubleshooting
  • Citation and manuscript status
  • Developer notes

Project Status And Scientific Boundary

BABAPPA has completed conservative explicit branch-truth simulation validation at 100,000 families on Apple Silicon/MPS. It has a validated deployable simulation-trained model package:

deployable_model_conservative_branch_site_100k_mps

The deployable package validates successfully:

  • status: ok
  • failures: 0
  • warnings: 0

The empirical bridge can process small real empirical diagnostic pilots, but BABAPPA scores are not final discovery claims.

Historical validation note: Branch-conditioned 10K streamed validation completed before the final 100K MPS run. Branch-conditioned labels may be proxy-derived in older or non-explicit workflows, so BABAPPA now distinguishes those cases from explicit branch-site simulator truth. A previous gate stated, "Final 100K is deferred until explicit branch-truth validation passes"; that gate has now been satisfied with a conditional-pass 100K explicit-truth validation, while empirical discovery claims remain blocked.

The simulation phase is oracle-supervised because simulator truth is known during validation. That oracle-supervised evidence is never supplied as an empirical inference input.

Empirical interpretation warning

A BABAPPA diagnostic-positive result is not, by itself, a publishable empirical positive-selection claim. It must be interpreted with matched-null calibration, reference-tool comparison, biological controls, and dataset-specific justification.

What BABAPPA Does

BABAPPA can:

  • predict branch-site support directly from a user-provided aligned codon MSA and matching treefile;
  • score one foreground tip, a comma-separated set of foreground tips, or all tree tips;
  • validate empirical CDS FASTA and tree inputs;
  • run optional alignment ensembles for diagnostic sensitivity analysis;
  • construct site maps and method-policy reports;
  • extract conservative empirical branch-site features;
  • audit empirical feature tables for forbidden truth-derived columns;
  • score branch-site rows using a packaged simulation-trained model;
  • classify empirical inputs as in_domain, borderline, or out_of_domain;
  • mark OOD cases as diagnostic_only;
  • produce guarded diagnostic reports;
  • prepare and parse codeml/HyPhy-style reference workflows;
  • plan simulation-matched empirical calibration;
  • audit storage and generate safe cleanup scripts for large reproducible outputs.

BABAPPA helps decide whether a dataset is suitable for deeper positive-selection analysis. It is a diagnostic decision-support framework, not an automatic discovery machine.

What BABAPPA Does Not Do

BABAPPA does not:

  • prove positive selection by itself;
  • replace codeml, HyPhy, biological controls, or expert interpretation;
  • make final empirical discovery claims without calibration and controls;
  • use simulator truth during empirical inference;
  • silently accept out-of-domain empirical inputs as positive-selection calls;
  • serve as a clinical, agricultural, regulatory, or policy decision tool.

Long-Run Handoff Policy

Codex and other assisted-maintenance sessions should not execute heavy empirical calibration, broad empirical scans, retraining, 10K/100K simulations, or long aligner/reference batches. The expected workflow is to generate reproducible USER-RUN scripts, validators, parsers, and reports; the user runs long jobs locally or offline and returns summaries/logs for interpretation.

Installation

After PyPI release:

python -m pip install babappa

Clone and install from source:

git clone <REPOSITORY_URL> BABAPPA
cd BABAPPA
python -m pip install -e .

For neural scoring, install BABAPPA in an environment with PyTorch available, for example the molevo conda environment used during development. The PyPI/source package includes the lightweight deployable model package used by the default predictor.

For development and tests:

python -m pip install -e ".[dev]"

Check the installed version:

babappa --version

Run tests:

python -m pytest -q

Current expected test state from the handoff:

351 passed, 58 skipped

External Dependencies

Required Python dependencies are installed through the package. Empirical and reference workflows may also need external command-line tools:

  • MAFFT
  • MUSCLE
  • BABAPPAlign
  • optional IQ-TREE2/IQ-TREE for tree building
  • optional codeml from PAML
  • optional HyPhy
  • optional PyTorch for deployable model scoring

Check aligners:

babappa check-aligners

BABAPPAlign requires the BABAPPAScore model cache:

mkdir -p "$HOME/.cache/babappalign/models"
curl -L "https://zenodo.org/record/18053201/files/babappascore.pt" -o "$HOME/.cache/babappalign/models/babappascore.pt"

The BABAPPAlign model is small enough to keep. Generated BABAPPAlign embedding caches can be very large and may be safely regenerated.

Apple Silicon / MPS

Apple Silicon/MPS support is research-alpha. It is useful for smoke tests, lightweight empirical scoring, and the completed 100K MPS validation.

Recommended shell settings:

export PYTORCH_ENABLE_MPS_FALLBACK=1
export OMP_NUM_THREADS=8
export MKL_NUM_THREADS=8
export OPENBLAS_NUM_THREADS=8
export NUMEXPR_NUM_THREADS=8

Check neural environment:

babappa check-neural-env

Run MPS smoke:

babappa smoke-mps-training --outdir mps_smoke --device auto --batch-size 32 --max-items 512
babappa validate-mps-smoke --smoke-dir mps_smoke

Light benchmark:

babappa benchmark-apple-silicon --outdir apple_silicon_benchmark --device auto --batch-sizes 32,64,128 --max-items 1024

If MPS fails, retry the relevant scoring stage with --device cpu or a smaller batch size.

Quick Start

Inspect commands:

babappa --help

Launch the interactive predictor:

babappa

BABAPPA will ask for:

  1. aligned codon MSA FASTA path
  2. treefile path
  3. foreground mode: leaves/all/specific

leaves is the default and scores every tree tip. specific asks for comma-separated tree-tip labels.

Main End-User Command: MSA + Tree To Branch-Site Calls

If you already have a codon MSA and a tree whose tip labels match the MSA IDs, this is the intended front door:

babappa predict-branch-sites \
  --msa my_gene.codon_aligned.fasta \
  --tree my_gene.treefile \
  --foreground all \
  --model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir my_gene_babappa_prediction \
  --device auto

To score only selected tree tips as foreground branches:

babappa predict-branch-sites \
  --msa my_gene.codon_aligned.fasta \
  --tree my_gene.treefile \
  --foreground Arabidopsis_thaliana,Arabidopsis_lyrata \
  --model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir my_gene_babappa_prediction \
  --device mps

BABAPPA does not realign input for this command. The user-supplied MSA is the alignment used for prediction. The prediction table reports both msa_codon_site/aligned_codon_site and branch_degapped_codon_site, so users can locate a call in the alignment column and in the de-gapped sequence coordinate of the scored branch.

Main outputs:

  • branch_site_predictions.tsv: site-by-branch scores and calls
  • branch_predictions.tsv: branch-level support summary
  • gene_summary.tsv: gene-level diagnostic summary
  • prediction_report.md: human-readable report
  • qc_report.md: input/applicability summary

Dry-run mode validates the MSA/tree and builds the feature table without model scoring:

babappa predict-branch-sites \
  --msa my_gene.codon_aligned.fasta \
  --tree my_gene.treefile \
  --foreground all \
  --outdir my_gene_babappa_dryrun \
  --dry-run

Internal Pipeline Commands

Validate the deployable package:

babappa validate-deployable-model-package --package-dir deployable_model_conservative_branch_site_100k_mps

Validate a tiny empirical input:

babappa validate-empirical-input \
  --cds-fasta tests/data/empirical_smoke/tiny_empirical.cds.fasta \
  --tree tests/data/empirical_smoke/tiny_empirical.treefile \
  --foreground taxon1 \
  --outdir empirical_input_smoke

Run a tiny empirical alignment ensemble:

babappa run-empirical-alignment-ensemble \
  --cds-fasta tests/data/empirical_smoke/tiny_empirical.cds.fasta \
  --tree tests/data/empirical_smoke/tiny_empirical.treefile \
  --foreground taxon1 \
  --outdir empirical_alignment_smoke \
  --methods identity,mafft,babappalign,muscle \
  --require-babappalign true \
  --threads 4

Extract empirical branch-site features:

babappa extract-empirical-branch-site-features \
  --empirical-validation-dir empirical_input_smoke \
  --alignment-dir empirical_alignment_smoke \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir empirical_features_smoke \
  --foreground taxon1

Audit feature safety:

babappa audit-empirical-features \
  --features empirical_features_smoke/empirical_branch_site_features.tsv \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir empirical_feature_audit_smoke

Run applicability/OOD gate:

babappa empirical-applicability \
  --empirical-validation-dir empirical_input_smoke \
  --empirical-feature-dir empirical_features_smoke \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir empirical_applicability_smoke

Score only after validation, feature audit, and applicability have run:

babappa score-empirical-branch-sites \
  --features empirical_features_smoke/empirical_branch_site_features.tsv \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --applicability-dir empirical_applicability_smoke \
  --outdir empirical_scores_smoke \
  --device auto

Plan simulation-matched calibration before writing the final diagnostic report:

babappa plan-simulation-matched-calibration \
  --empirical-validation-dir empirical_input_smoke \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir simulation_matched_calibration_plan_smoke

Generate report:

babappa make-empirical-branch-site-report \
  --outdir empirical_report_smoke \
  --empirical-validation-dir empirical_input_smoke \
  --alignment-dir empirical_alignment_smoke \
  --feature-dir empirical_features_smoke \
  --feature-audit-dir empirical_feature_audit_smoke \
  --applicability-dir empirical_applicability_smoke \
  --scoring-dir empirical_scores_smoke \
  --simulation-matched-calibration-plan simulation_matched_calibration_plan_smoke \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps

Typical Workflows

1. Simulation Validation Workflow

Use simulation commands for development and validation, not empirical discovery.

Tiny simulation:

babappa simulate --outdir sim_smoke --n-families 3 --n-taxa 6 --n-codons 60 --seed 42 --positive-rate 0.5 --saturation-tier moderate
babappa validate-sim --sim-dir sim_smoke
babappa audit-sim --sim-dir sim_smoke --outdir sim_smoke/audit

Alignment and feature-building commands include:

babappa align-sim --sim-dir sim_smoke --outdir align_smoke
babappa validate-align --align-dir align_smoke
babappa build-site-map --sim-dir sim_smoke --align-dir align_smoke --outdir site_map_smoke
babappa validate-site-map --site-map-dir site_map_smoke

Heavy 10K/100K plans are user-run only and should not be launched casually.

2. Deployable Model Package Validation

The validated package is:

deployable_model_conservative_branch_site_100k_mps

Validate package integrity:

babappa validate-deployable-model-package --package-dir deployable_model_conservative_branch_site_100k_mps

Smoke-load package:

babappa smoke-load-deployable-model \
  --package-dir deployable_model_conservative_branch_site_100k_mps \
  --device auto \
  --outdir deployable_model_load_smoke

The package includes:

  • model_manifest.json
  • model_card.md
  • feature_schema.json
  • calibration_schema.json
  • training_envelope.json
  • tier_models/
  • tier_calibrations/
  • checksums.sha256
  • validation_summary.json
  • limitations.md
  • README.md

3. Real Empirical Input Staging

Prepare a real pilot workspace:

babappa prepare-real-empirical-pilot-workspace --workspace real_empirical_pilot --max-families 12
babappa prepare-real-pilot-inputs --workspace real_empirical_pilot --manifest real_empirical_pilot_panel.tsv --outdir real_empirical_pilot/input_staging

Canonical input paths:

real_empirical_pilot/input/cds/<panel_id>.cds.fasta
real_empirical_pilot/input/trees/<panel_id>.treefile

Import one family:

babappa import-real-pilot-family \
  --workspace real_empirical_pilot \
  --panel-id FAMILY_ID \
  --gene-family "GENE_FAMILY" \
  --species-group "SPECIES_GROUP" \
  --cds-fasta /path/to/family.cds.fasta \
  --tree-file /path/to/family.treefile \
  --foreground TAXON_NAME \
  --expected-category likely_positive \
  --reference-status planned \
  --notes "real pilot candidate"

Batch import:

babappa import-real-pilot-batch --workspace real_empirical_pilot --batch-manifest real_empirical_pilot/import_batch.tsv

Validate readiness:

babappa validate-real-pilot-readiness \
  --workspace real_empirical_pilot \
  --manifest real_empirical_pilot_panel.tsv \
  --outdir real_empirical_pilot/readiness

Do not run the pilot until readiness says ready_to_run: true.

4. Empirical Diagnostic Workflow

Screen a family before scoring:

babappa prefilter-empirical-family \
  --cds-fasta real_empirical_pilot/input/cds/FAMILY_ID.cds.fasta \
  --tree-file real_empirical_pilot/input/trees/FAMILY_ID.treefile \
  --foreground TAXON_NAME \
  --outdir real_empirical_pilot/prefilter/FAMILY_ID \
  --max-mean-pdistance 0.35 \
  --min-taxa 6 \
  --min-codons 100

Run a small guarded panel:

babappa run-empirical-pilot-panel \
  --panel-manifest real_empirical_pilot/manifest/real_empirical_pilot_panel.tsv \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir real_empirical_pilot/babappa_run \
  --methods identity,mafft,babappalign,muscle \
  --device auto \
  --max-families 12

Summarize and validate the panel:

babappa summarize-empirical-pilot-panel --panel-run real_empirical_pilot/babappa_run --outdir real_empirical_pilot/summary
babappa validate-empirical-pilot-summary --summary-dir real_empirical_pilot/summary

5. WRKY-Style Close-Taxa Pilot Workflow

For Arabidopsis-like WRKY families, do not mix very distant plant taxa at first. Start with closer Brassicaceae-heavy taxa:

babappa recommend-target-taxa --pilot-type plant_close --outdir real_empirical_pilot/target_taxa_recommendations

Plan an OOD-aware family build:

babappa plan-ood-aware-family-build \
  --family-id WRKY_candidate_02_close \
  --query-species Arabidopsis_thaliana \
  --query-gene-or-locus AT2G38470 \
  --target-taxa-file real_empirical_pilot/target_taxa_recommendations/recommended_target_taxa.tsv \
  --outdir real_empirical_pilot/acquisition_plans/WRKY_candidate_02_close \
  --max-mean-pdistance 0.35 \
  --min-taxa 6 \
  --min-codons 100

Current WRKY interpretation:

  • WRKY_candidate_01: OOD stress test, mean p-distance 0.725799, diagnostic-only, no positive call.
  • WRKY_candidate_02_close: in-domain close-taxa WRKY33/AT2G38470 diagnostic pilot, BABAPPA diagnostic-positive, max gene support 0.177189, called branch-site rows 6954.
  • codeml Model A vs null: LRT 0.0, p-value 1.0, negative.
  • HyPhy aBSREL foreground p-value: 1.0, negative.
  • HyPhy MEME minimum p-value: 0.0641705, negative at 0.05.
  • Concordance: BABAPPA_only.
  • Matched-null calibration: 100 feature-level matched nulls completed and validated with the deployable model package.
  • Null result: called branch-site rows were unusual versus the feature-matched null (p_empirical_called_rows=0.009900990099009901), but max gene support was not unusual (p_empirical_support=1.0).

Correct interpretation: BABAPPA-only with mixed feature-level null support; still inconclusive as an empirical discovery claim because codeml and HyPhy are negative and the null calibration is feature-level rather than full raw sequence simulation/alignment replay.

6. Simulation-Matched Calibration Planning

Plan calibration from empirical QC:

babappa plan-simulation-matched-calibration \
  --empirical-validation-dir real_empirical_pilot/babappa_run/per_family/FAMILY_ID/empirical_input_validation \
  --deployable-model-package deployable_model_conservative_branch_site_100k_mps \
  --outdir real_empirical_pilot/babappa_run/per_family/FAMILY_ID/simulation_matched_calibration_plan

Summarize plan:

babappa summarize-simulation-matched-calibration-plan \
  --plan-dir real_empirical_pilot/babappa_run/per_family/FAMILY_ID/simulation_matched_calibration_plan \
  --outdir real_empirical_pilot/babappa_run/per_family/FAMILY_ID/simulation_matched_calibration_summary

The WRKY 100-null feature-level matched calibration has completed once under user control. It should be treated as diagnostic support only, not as a final empirical p-value system or discovery proof.

Dry-run the evidence-pack calibration command before launching anything long:

babappa run-simulation-matched-null-calibration \
  --evidence-pack real_empirical_pilot/evidence_packs/WRKY_candidate_02_close \
  --outdir real_empirical_pilot/calibration_runs/WRKY_candidate_02_close_null100_dryrun \
  --n-null 100 \
  --seed 20260530 \
  --device mps \
  --dry-run

Dry-run mode validates the evidence pack and writes:

  • calibration_run_plan.json
  • calibration_run_plan.md
  • calibration_input_validation.tsv
  • calibration_status.json
  • calibration_status.md

It does not write null distributions, null percentiles, or discovery-supporting results.

To rerun the feature-level matched-null calibration:

babappa run-simulation-matched-null-calibration \
  --evidence-pack real_empirical_pilot/evidence_packs/WRKY_candidate_02_close \
  --outdir real_empirical_pilot/calibration_runs/WRKY_candidate_02_close_null100 \
  --n-null 100 \
  --seed 20260530 \
  --device mps

Current implementation note: the evidence-pack command is operational for safe dry-run/planning and for conservative feature-level matched-null scoring through the deployable model package. This is diagnostic calibration support, not a full raw sequence simulation plus alignment replay. Do not interpret staged or dry-run files as completed calibration, and do not treat feature-level null support as a standalone empirical discovery claim.

7. Classical Reference Workflow Planning

Plan codeml/HyPhy templates:

babappa plan-classical-reference-workflows \
  --panel-manifest real_empirical_pilot/manifest/real_empirical_pilot_panel.tsv \
  --outdir real_empirical_pilot/reference_plan \
  --tools codeml,hyphy

Check reference tools:

babappa check-reference-tools --outdir real_empirical_pilot/reference_runs/WRKY_candidate_02_close/tool_check

Parse prepared outputs:

babappa parse-codeml-reference \
  --codeml-dir real_empirical_pilot/reference_runs/WRKY_candidate_02_close/codeml \
  --outdir real_empirical_pilot/reference_runs/WRKY_candidate_02_close/codeml_parsed

babappa parse-hyphy-reference \
  --hyphy-dir real_empirical_pilot/reference_runs/WRKY_candidate_02_close/hyphy \
  --outdir real_empirical_pilot/reference_runs/WRKY_candidate_02_close/hyphy_parsed

Build reference results:

babappa build-reference-results-table \
  --panel-id WRKY_candidate_02_close \
  --codeml-parsed real_empirical_pilot/reference_runs/WRKY_candidate_02_close/codeml_parsed \
  --hyphy-parsed real_empirical_pilot/reference_runs/WRKY_candidate_02_close/hyphy_parsed \
  --outdir real_empirical_pilot/reference_results/WRKY_candidate_02_close

Compare:

babappa compare-empirical-reference-results \
  --babappa-panel-run real_empirical_pilot/babappa_run_wrky_close_raw_alignmentaware \
  --reference-results real_empirical_pilot/reference_results/WRKY_candidate_02_close/reference_results.tsv \
  --outdir real_empirical_pilot/comparison/WRKY_candidate_02_close

Input Requirements

Empirical inputs should include:

  • CDS FASTA with codon-valid sequences;
  • tree file with tips matching FASTA IDs;
  • foreground taxon or branch label;
  • optional metadata describing expected category and reference status;
  • close enough taxa for the current training envelope;
  • at least 6 taxa preferred;
  • at least 100 codons preferred.

Input checks include:

  • duplicate sequence IDs;
  • CDS length divisibility by 3;
  • internal stop codons;
  • ambiguous base fraction;
  • gap fraction;
  • pairwise p-distance;
  • saturation proxy;
  • foreground validity;
  • tree-tip compatibility.

Do not provide simulator truth or oracle labels during empirical inference. Forbidden empirical input columns include:

  • branch_site_truth
  • selected_sites
  • truth
  • branch_truth
  • oracle
  • y_branch_site
  • y_site
  • gene_label
  • positive_label
  • simulated_label

Aligners

For the main command, BABAPPA does not run aligners. The supplied codon MSA is the authoritative input:

babappa predict-branch-sites --msa aligned.codon.fasta --tree treefile --foreground all --outdir prediction

Optional diagnostic alignment/sensitivity workflows can use:

  • identity
  • mafft
  • babappalign
  • muscle

Diagnostic-only aligners:

  • PRANK
  • T-Coffee

Alignment ensemble robustness matters only when the user wants to test sensitivity to homology uncertainty. It is not required for the core user-supplied-MSA prediction workflow.

Output Interpretation

Common terms:

  • diagnostic-positive: BABAPPA scored support above its current diagnostic threshold. This is not a discovery claim.
  • diagnostic_only: output may be useful for stress testing or triage but should not be interpreted as positive selection.
  • in_domain: empirical input appears compatible with the training envelope.
  • borderline: empirical input has warnings and should be interpreted cautiously.
  • out_of_domain: empirical input falls outside the current training envelope; abstain from biological interpretation.
  • BABAPPA_only: BABAPPA is positive but reference tools are negative or pending; treat as inconclusive until calibration and controls.
  • concordant_positive: BABAPPA and at least one reference workflow support compatible evidence, subject to calibration and controls.
  • reference_only: reference tool positive but BABAPPA not supportive; inspect alignment, OOD, and model limitations.
  • calibration_pending: matched-null empirical calibration has not completed; do not report calibrated empirical significance.
  • feature_matched_calibration_complete: feature-level matched null scoring has completed; interpret as diagnostic calibration support, not as a full raw sequence simulation/alignment replay.

Responsible reporting language:

  • use "diagnostic support" or "guarded empirical score";
  • report applicability/OOD status;
  • report aligner/method-policy status;
  • report codeml/HyPhy comparison;
  • state whether simulation-matched calibration is pending or complete;
  • avoid "BABAPPA discovered positive selection" unless future calibration/reference/control criteria are met.

Reproducibility

Important retained artifacts:

  • deployable package: deployable_model_conservative_branch_site_100k_mps
  • final 100K validation report: explicit_branch_truth_100k_mps_final_validation_report.md/json/tsv
  • cross-tier summary: explicit_branch_truth_100k_mps_cross_tier_summary/
  • truth audit: branch_truth_status_audit_explicit_branch_truth_100k_mps/
  • WRKY evidence pack: real_empirical_pilot/evidence_packs/WRKY_candidate_02_close/
  • Git readiness report: GIT_PUSH_READINESS_REPORT.md

Existing Zenodo-ready archive:

BABAPPA_0.5.0-alpha_release_zenodo_20260530.tar.xz

Checksum:

cc259617f19d9634fd6e11906903910498ab78d3797a10df1bb24b7db014dc30

Validate package:

babappa validate-deployable-model-package --package-dir deployable_model_conservative_branch_site_100k_mps

Validate WRKY evidence pack:

babappa validate-empirical-evidence-pack --evidence-pack real_empirical_pilot/evidence_packs/WRKY_candidate_02_close

Run tests:

python -m pytest -q

Storage Cleanup And User Maintenance

BABAPPA simulations can generate very large reproducible outputs. Audit before deleting anything:

babappa audit-storage --root . --outdir storage_cleanup_audit --target-size-gb 10

Outputs include:

  • storage_inventory.tsv
  • storage_inventory.json
  • storage_summary.md
  • keep_list.tsv
  • remove_candidates.tsv
  • archive_candidates.tsv
  • cleanup_dry_run.md
  • du_top_100.txt
  • quarantine_large_reproducible_outputs.sh
  • delete_quarantine_after_review.sh
  • archive_key_reports.sh
  • validate_after_cleanup.sh

Move candidates to quarantine only:

bash storage_cleanup_audit/quarantine_large_reproducible_outputs.sh

Validate after cleanup:

bash storage_cleanup_audit/validate_after_cleanup.sh

Do not run the permanent delete script until the quarantine has been manually reviewed. The delete script requires CONFIRM_DELETE=YES.

Recent storage note: the large system storage issue was caused by a generated BABAPPAlign embeddings cache at $HOME/.cache/babappalign/embeddings, not by the BABAPPA Git checkout. The required model file $HOME/.cache/babappalign/models/babappascore.pt should be preserved.

Troubleshooting

Missing aligners

Run:

babappa check-aligners

If BABAPPAlign reports a missing model, install babappascore.pt into $HOME/.cache/babappalign/models/.

MPS/CUDA/CPU device problems

Run:

babappa check-neural-env

Use --device cpu if MPS/CUDA fails or if a tensor operation is unsupported.

Very high p-distance or OOD input

Use closer taxa. For plant WRKY pilots, start with close Brassicaceae panels rather than broad monocot/dicot/legume mixtures.

codeml/HyPhy disagreement

Treat disagreement conservatively. BABAPPA-only positive signals require matched-null calibration, controls, and biological review.

Pruned intermediates

Some raw 100K intermediates were intentionally pruned after validation. Use retained summaries, audits, stage markers, model artifacts, checksums, and cleanup manifests for reproducibility.

Package validation failure

Check that model_manifest.json, schemas, checksums, tier models, tier calibrations, and validation summary are present.

Git cleanup confusion

Generated heavy outputs should not be committed. Use:

git status --short
git diff --stat
git diff --cached --stat

Citation And Manuscript Status

BABAPPA is currently described by a research-alpha software/methods manuscript in:

Manuscript/BABAPPA_method_paper_auxiliary_saturation.tex

No final publication DOI is available yet. Use the repository and release archive metadata until a formal citation is assigned.

Citation placeholder:

Sinha K. BABAPPA: a research-alpha, simulation-trained framework for guarded branch-site positive-selection support under alignment uncertainty. Manuscript in preparation.

PyPI Release Workflow

The package metadata lives in pyproject.toml, and the console entry point is:

babappa = "babappa.cli:main"

Build locally:

python -m pip install -e ".[dev]"
python -m build
python -m twine check dist/*

Upload to TestPyPI first:

python -m twine upload --repository testpypi dist/*

Then test installation in a fresh environment. Upload to PyPI only after the TestPyPI package installs and babappa --version plus babappa --help work.

Developer Notes

Check version:

babappa --version

Run tests:

python -m pytest -q

Inspect Git state:

git status --short
git diff --stat
git diff --cached --stat

Do not commit:

  • raw 10K/100K simulations;
  • raw alignments;
  • tensor shards;
  • branch-site datasets;
  • prediction tables from heavy runs;
  • logs;
  • temporary work directories;
  • generated BABAPPAlign embeddings caches;
  • raw empirical downloads;
  • BLAST databases or downloaded genomes/proteomes.

Commit and archive:

  • source code;
  • tests;
  • docs;
  • examples;
  • manuscript source/PDF;
  • deployable package metadata and selected lightweight model artifacts;
  • final validation reports;
  • evidence-pack manifests and summaries;
  • checksums;
  • cleanup manifests.

Scientific Bottom Line

BABAPPA is ready for guarded research-alpha software/methods communication and reproducible evaluation. It is not ready for unsupported empirical positive-selection discovery claims. The next empirical step is to add close-taxa negative controls and interpret BABAPPA outputs jointly with codeml/HyPhy references, biological controls, and any future full raw sequence matched-null calibration.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappa-0.5.2a0.tar.gz (627.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappa-0.5.2a0-py3-none-any.whl (660.2 kB view details)

Uploaded Python 3

File details

Details for the file babappa-0.5.2a0.tar.gz.

File metadata

  • Download URL: babappa-0.5.2a0.tar.gz
  • Upload date:
  • Size: 627.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for babappa-0.5.2a0.tar.gz
Algorithm Hash digest
SHA256 02073b2552adaf8ee70f08c0ab465ed8f3470f62b2d4717630313653ac7c20be
MD5 7a397a1c9d2dbcd889063f94e7842dac
BLAKE2b-256 f0bbbfac7c255cba9492fcfd60ebf55ed0ef176fa0e415ae124739598e3005a4

See more details on using hashes here.

File details

Details for the file babappa-0.5.2a0-py3-none-any.whl.

File metadata

  • Download URL: babappa-0.5.2a0-py3-none-any.whl
  • Upload date:
  • Size: 660.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for babappa-0.5.2a0-py3-none-any.whl
Algorithm Hash digest
SHA256 9232fdb070e679fed62af3c67e1fefbd56200a034e96e2c8ffe8211669598bcf
MD5 222ddc785485179f87545587e7f3db11
BLAKE2b-256 2f4bf935f2ef0b238a1510a73a8004fba34d66dae9acc5546362cf3533632dc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page