scBOA: scRNA-seq Bayesian Optimization and Analysis pipeline

These details have not been verified by PyPI

Project links

Project description

In scBOA/README.md

scBOA: scRNA-seq Bayesian Optimization and Analysis

scBOA is an integrated, two-stage computational pipeline for single-cell RNA sequencing (scRNA-seq) analysis. It automates the discovery of optimal processing parameters using Bayesian Optimization (Stage 1) and then applies these parameters to a comprehensive downstream analysis workflow (Stage 2). The pipeline also features an optional multi-level refinement process (Stage 3/4) to iteratively re-analyze and improve annotations for low-confidence cell clusters.

Key Features

Automated Parameter Tuning: Uses Bayesian Optimization to find the best parameters (n_highly_variable_genes, n_pcs, n_neighbors, resolution) for clustering and cell type annotation.
Multi-Metric Objective Function: Optimizes for a balanced score that considers annotation accuracy (CAS), marker gene specificity (MCS), and cluster separation (Silhouette score).
Single & Multi-Sample Modes: Natively supports analysis of a single dataset or the integration of two datasets (e.g., control vs. treated) using Harmony.
Iterative Refinement: Automatically identifies low-confidence cell clusters and re-runs the entire optimization and analysis pipeline on them to improve annotation granularity and accuracy.
Comprehensive Outputs: Generates publication-quality plots, detailed metric reports, annotated data objects (.h5ad), and summary tables for easy interpretation.

Step-by-Step Workflow

1. Prerequisites

Python 3.9 or newer.
Access to a Linux or macOS command line.

2. Installation

It is highly recommended to install scboa in a dedicated virtual environment to avoid conflicts with other Python projects.

# Create a virtual environment named 'scboa-env'
python3 -m venv scboa-env

# Activate the environment
source scboa-env/bin/activate

# Now, install scBOA from PyPI
pip install scboa

# To deactivate the environment later, simply run: deactivate

3. Prepare Your Data

scRNA-seq Data: Ensure your Cell Ranger output (the folder containing barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz) is accessible.
CellTypist Model: Download a pre-trained CellTypist model (.pkl file). You can find available models on the official CellTypist models website.

4. Run the Pipeline

Here is an example command for a single-sample analysis:

scboa \
  --data_dir /path/to/your/cellranger_output/ \
  --output_dir ./my_analysis_output/ \
  --model_path /path/to/your/celltypist_model.pkl \
# --- Bayesian Optimization Parameters ---
  --n_calls 50 \
  --target all \
  --model_type biological \
  --seed 42 \
# --- Analysis & Clustering Parameters ---
  --hvg_min_mean 0.0125 \
  --hvg_max_mean 3.0 \
  --hvg_min_disp 0.3 \
  --cas_aggregation_method leiden

For Developers (Contributing)

This clones only the latest commit, making it much faster and smaller

git clone --depth 1 https://github.com/QiangSu/scBOA.git cd scBOA

Install in editable mode, which also installs development dependencies

pip install -e .

Command-Line Arguments Explained

`Stage 1 & 2: Main I/O and Mode`

Argument	Description	Explanation/Usage
`--data_dir <path>`	Path to 10x Genomics data.	(Single-Sample Mode) Provide the path to the directory containing `matrix.mtx.gz`, etc.
`--multi_sample <path1> <path2>`	Two paths for WT and Treated 10x data.	(Multi-Sample Mode) Provide two paths, first for control/WT, second for treated/perturbed. This mode enables Harmony integration.
`--output_dir <path>`	Path for all output files.	The main directory where all results, plots, and logs will be saved. Subdirectories for each stage will be created here.
`--model_path <path>`	Path to CellTypist model (`.pkl`).	Required. The pre-trained model used for cell type annotation.
`--output_prefix <str>`	Base prefix for Stage 1 output files.	Default: `bayesian_opt`. Used for naming optimization reports and plots.

`Stage 1: Optimization Parameters`

Argument	Description	Explanation/Usage
`--seed <int>`	Global random seed for reproducibility.	Default: `42`. Ensures that results are identical if run with the same data and parameters.
`--n_calls <int>`	Number of trials for EACH optimization strategy.	Default: `50`. The script runs three strategies (Explore, Exploit, BO-EI), so `50` means a total of 150 optimization steps.
`--model_type <choice>`	Optimization objective function type.	`biological`: Balances annotation agreement (CAS) and marker specificity (MCS). `structural` (default): Adds cluster separation (Silhouette Score) to the biological metrics for more robust clusters. `silhouette`: Optimizes solely for the best Silhouette Score.
`--marker_gene_model <choice>`	Genes to use for MCS calculation.	`all`: All genes are considered. `non-mitochondrial` (default): Excludes mitochondrial genes, which often act as non-specific markers of cell stress.
`--target <choice>`	Optimization target metric.	`all` (default): Runs a single, balanced optimization (equivalent to `--model_type`). `weighted_cas`, `simple_cas`, `mcs`: Runs optimization targeting only that specific metric.
`--cas_aggregation_method <choice>`	Method for calculating Simple CAS.	`leiden` (default): Averages the purity score of each raw Leiden cluster. Best for assessing technical cluster quality. `consensus`: Merges clusters with the same final cell type label before averaging purity. Best for assessing biological group quality.

`Stage 1 & 2: HVG Selection Method`

Argument	Description	Explanation/Usage
`--hvg_min_mean <float>`	Min mean for two-step HVG selection.	If set, activates a pre-filtering step on genes based on expression and dispersion before selecting the top `n_hvg`.
`--hvg_max_mean <float>`	Max mean for two-step HVG selection.	See above.
`--hvg_min_disp <float>`	Min dispersion for two-step HVG selection.	See above.

`Stage 1 & 2: QC & Filtering Parameters`

Argument	Description	Explanation/Usage
`--min_genes <int>`	Min genes per cell.	Default: `200`. Filters out low-quality cells/empty droplets.
`--max_genes <int>`	Max genes per cell.	Default: `7000`. Filters out potential doublets.
`--max_pct_mt <float>`	Max mitochondrial percentage.	Default: `10.0`. Filters out stressed or dying cells.
`--min_cells <int>`	Min cells per gene.	Default: `3`. Filters out genes with negligible expression.

`Stage 2 & Optional Refinement: Final Run Parameters`

Argument	Description	Explanation/Usage
`--final_run_prefix <str>`	Prefix for Stage 2 output files.	Default: `sc_analysis_repro`.
`--fig_dpi <int>`	Resolution (DPI) for saved figures.	Default: `500`.
`--n_pcs_compute <int>`	Number of principal components to compute.	Default: `105`. A higher number allows for a wider search space for the optimal `n_pcs`.
`--n_top_genes <int>`	Number of top marker genes to show.	Default: `5`. Affects dot plots, heatmaps, and marker gene tables.
`--cellmarker_db <path>`	Path to a cell marker database (.csv).	(Optional) If provided, performs a "manual-style" annotation based on cluster marker genes and calculates a Marker Capture Score.
`--n_degs_for_capture <int>`	DEGs per cluster for Marker Capture Score.	Default: `50`. Number of top differentially expressed genes used to match against the marker DB.
`--cas_refine_threshold <float>`	CAS threshold to trigger refinement.	(Optional) If a cluster's CAS score is below this value (e.g., `90`), its cells are pooled for a new round of optimization and analysis.
`--refinement_depth <int>`	Maximum number of refinement iterations.	Default: `1`. If refinement is triggered, this controls how many times the process can repeat on the subsequently failing cells.

Output Directory Structure

The script generates a structured output directory. Below is an example structure and an explanation of key files.

<output_dir>/
├── stage_1_bayesian_optimization/
│   ├── bayesian_opt_structural_balanced_FINAL_annotated.h5ad
│   ├── bayesian_opt_structural_balanced_FINAL_best_params.txt
│   ├── bayesian_opt_structural_balanced_yield_scores_report.csv
│   ├── bayesian_opt_structural_balanced_optimizer_convergence.png
│   ├── bayesian_opt_structural_balanced_BO-EI_opt_result.skopt
│   ├── ... (other plots and strategy files) ...
│   └── refinement_depth_1/
│       ├── ... (mirrors the structure above, but for the refined subset of cells) ...
│
└── stage_2_final_analysis/
    ├── sc_analysis_repro_final_processed.h5ad
    ├── sc_analysis_repro_final_processed_with_refinement.h5ad
    ├── sc_analysis_repro_all_annotations.csv
    ├── sc_analysis_repro_all_annotations_with_refinement.csv
    ├── sc_analysis_repro_leiden_cluster_annotation_scores.csv
    ├── sc_analysis_repro_consensus_group_annotation_scores.csv
    ├── sc_analysis_repro_combined_cluster_annotation_scores.csv
    ├── sc_analysis_repro_cell_type_journey_summary.csv
    ├── sc_analysis_repro_umap_leiden.png
    ├── sc_analysis_repro_cluster_celltypist_umap.png
    ├── sc_analysis_repro_umap_low_confidence_greyed.png
    ├── ... (many other plots and result files) ...
    └── refinement_depth_1/
        ├── sc_analysis_repro_refinement_depth_1_final_processed.h5ad
        ├── sc_analysis_repro_refinement_depth_1_umap_cumulative_result.png
        └── ... (mirrors Stage 2 structure for the refined subset) ...

Key File Explanations

Stage 1: `stage_1_bayesian_optimization/`

*_FINAL_best_params.txt: A summary of the optimal parameters found and the final performance metrics. This is the most important summary file.
*_FINAL_annotated.h5ad: The AnnData object processed with the best parameters, containing all final annotations from the single best run.
*_yield_scores_report.csv: A detailed log of every trial from every optimization strategy, including parameters tested and all resulting scores (CAS, MCS, Silhouette).
*_optimizer_convergence.png: A plot showing how the best score improved over time for each strategy.
*_opt_result.skopt: A saved state of the optimization process, which can be reloaded.

Stage 2: `stage_2_final_analysis/`

*_final_processed.h5ad: The final, fully annotated AnnData object from the initial Stage 2 run. Contains UMAP coordinates, clustering, and all annotations.
*_final_processed_with_refinement.h5ad: (If refinement runs) The master AnnData object with the final, combined annotations after all refinement levels are complete.
*_all_annotations_with_refinement.csv: A cell-by-cell table of all annotations, including the final combined_annotation column after refinement.
*_cluster_annotation_scores.csv: Tables detailing the Cell Annotation Score (CAS) for each Leiden cluster and each consensus cell type group.
*_combined_cluster_annotation_scores.csv: A concatenation of all CAS reports from the initial run and all refinement levels.
*_cell_type_journey_summary.csv: A wide-format table showing how the cell count and CAS score for each cell type change across refinement stages.
*_umap_low_confidence_greyed.png: A UMAP plot from the initial run where cells belonging to clusters that failed the CAS threshold are colored grey.
refinement_depth_1/*_umap_cumulative_result.png: A UMAP plot showing the state of the data after a refinement level, with newly-annotated cells colored and any still-failing cells shown in grey.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.10

Apr 25, 2026

0.1.9

Apr 25, 2026

0.1.8

Jan 2, 2026

This version

0.1.7

Dec 18, 2025

0.1.6

Nov 24, 2025

0.1.5

Nov 23, 2025

0.1.4

Nov 21, 2025

0.1.3

Nov 21, 2025

0.1.2

Nov 21, 2025

0.1.1

Nov 20, 2025

0.1.0

Nov 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scboa-0.1.7.tar.gz (40.2 kB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scboa-0.1.7-py3-none-any.whl (36.6 kB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file scboa-0.1.7.tar.gz.

File metadata

Download URL: scboa-0.1.7.tar.gz
Upload date: Dec 18, 2025
Size: 40.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scboa-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`f151943e551c93966f2ae50e4df636d57362dab6672d4f0053074f80533e34b9`
MD5	`1e73236bbfdb90d346cd945d5c779166`
BLAKE2b-256	`dd1b6d8608506c5aedb876d40a7c473620165ec1a1255848f393ac739a1b674f`

See more details on using hashes here.

File details

Details for the file scboa-0.1.7-py3-none-any.whl.

File metadata

Download URL: scboa-0.1.7-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 36.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for scboa-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc241a084ceb0e6d3d0c459ff82d4cef263bbcc837ceaadd75591d22b690de76`
MD5	`597074f2807e4b53c66dda43549de3a2`
BLAKE2b-256	`b1da69cb75e00146479cb866b6588778f3108e1487da36487633413ca43b9fc9`

See more details on using hashes here.

scboa 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

In scBOA/README.md

scBOA: scRNA-seq Bayesian Optimization and Analysis

Key Features

Step-by-Step Workflow

1. Prerequisites

2. Installation

3. Prepare Your Data

4. Run the Pipeline

For Developers (Contributing)

This clones only the latest commit, making it much faster and smaller

Install in editable mode, which also installs development dependencies

Command-Line Arguments Explained

Stage 1 & 2: Main I/O and Mode

Stage 1: Optimization Parameters

Stage 1 & 2: HVG Selection Method

Stage 1 & 2: QC & Filtering Parameters

Stage 2 & Optional Refinement: Final Run Parameters

Output Directory Structure

Key File Explanations

Stage 1: stage_1_bayesian_optimization/

Stage 2: stage_2_final_analysis/

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Stage 1 & 2: Main I/O and Mode`

`Stage 1: Optimization Parameters`

`Stage 1 & 2: HVG Selection Method`

`Stage 1 & 2: QC & Filtering Parameters`

`Stage 2 & Optional Refinement: Final Run Parameters`

Stage 1: `stage_1_bayesian_optimization/`

Stage 2: `stage_2_final_analysis/`