Adaptive batch grouping for single-cell RNA-seq data
Project description
B2G: Batch-to-Group
B2G (Batch-to-Group) is an intelligent batch grouping tool for single-cell RNA-seq data that combines metacell/Leiden clustering with PERMANOVA-based prior selection.
Why B2G and What We Do
Motivation
Recent studies increasingly focus on dissecting cellular subpopulations from single-cell atlases of longitudinal clinical cohorts. However, residual batch effects within these subpopulations are difficult to eliminate due to their highly similar cellular states. Existing methods typically process samples individually during subpopulation-level batch correction, failing to recognize that patients with similar clinical metadata often form batch-effect-free groups. This leads to over-correction and loss of biologically meaningful signals.
Key Challenges:
- Residual Batch Effects: Global batch correction lacks resolution, leaving batch effects in specific cell types
- Over-correction Risk: High transcriptional similarity within subpopulations makes it difficult to distinguish technical noise from biological variation
- Manual Grouping Limitations: The complexity of possible grouping combinations makes manual optimization impractical
- Cell Type Specificity: Batch effects vary across different cell types, requiring distinct grouping strategies
Solution: B2G Framework
We propose Batch2Group (B2G), which infers batch-effect-free groups based on clinical metadata to enable precise batch correction. B2G addresses these challenges through:
- Adaptive Prior Selection: Automatically evaluates and selects informative biological priors using PERMANOVA
- Intelligent Grouping: Identifies patients with similar clinical features that exhibit minimal batch effects
- Dual Clustering Support: Provides both metacell and Leiden clustering for flexible analysis
- Group-level Correction: Performs batch correction at group level rather than individual sample level
Main Findings
Across multiple single-cell datasets, B2G demonstrates superior performance in:
- Better Batch Effect Removal: More effective elimination of technical and batch effects
- Biological Variation Preservation: Maintains genuine biological signals while removing technical noise
- Automated Workflow: Eliminates need for manual trial-and-error in grouping strategy
- Scalable Analysis: Efficiently handles large-scale longitudinal cohort data
Features
- Adaptive Prior Selection: Automatically selects informative biological priors using PERMANOVA
- Flexible Clustering: Supports both metacell and Leiden clustering methods
- Batch Effect Mitigation: Groups batches intelligently to minimize confounding effects
- Seamless Integration: Works directly with Scanpy/AnnData workflows
- Comprehensive Visualization: Generates dendrograms and evaluation plots
Installation
From PyPI (recommended)
pip install b2g_tools
From Source
git clone https://github.com/lyotvincent/b2g.git
cd b2g
pip install -e .
Environment Setup with Conda (alternative)
# Create environment with all dependencies
conda create -n b2g python=3.10 scanpy scikit-learn scikit-bio leidenalg -c conda-forge -c bioconda -y
conda activate b2g
# Install pip-only packages
pip install metacells dynamicTreeCut scib
pip install b2g-tools
User Guide
Basic Usage (No Prior Knowledge)
import scanpy as sc
import b2g
# Load your data (with raw counts)
adata = sc.read_h5ad('your_data.h5ad')
# Run B2G grouping with default settings
b2g.group(
adata,
batch_key='donor_id',
method='metacell',
target_metacell_size=48,
key_added='groups'
)
# The grouping result is now in adata.obs['groups']
print(adata.obs['groups'].value_counts())
# Continue with your standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=30)
# Use the grouping for batch correction
sc.external.pp.harmony_integrate(adata, key='groups')
Usage with Biological Priors
import scanpy as sc
import b2g
# Load data
adata = sc.read_h5ad('lung_dataset.h5ad')
# Filter to specific cell type (optional but recommended)
adata_subset = adata[adata.obs['cell_type'] == 'Endothelial cells'].copy()
# Run B2G with biological priors
b2g.group(
adata_subset,
batch_key='donor_id',
additional_features=[
{'column': 'disease', 'description': 'Disease status'},
{'column': 'sex', 'description': 'Biological sex'},
{'column': 'self_reported_ethnicity', 'description': 'Ethnicity'},
{'column': 'development_stage', 'description': 'Development stage'}
],
method='metacell',
target_metacell_size=48,
min_priors=1,
max_priors=3,
gap_threshold=0.1,
output_dir='b2g_results',
fig_path='b2g_results/figures',
met_path='b2g_results/metrics',
key_added='groups'
)
# Check grouping results
print(adata_subset.obs['groups'].value_counts())
Using Leiden Clustering Instead of Metacell
b2g.group(
adata,
batch_key='donor_id',
method='leiden',
leiden_resolution=1.0,
additional_features=[
{'column': 'disease', 'description': 'Disease status'},
{'column': 'sex', 'description': 'Biological sex'}
],
output_dir='b2g_leiden_results',
fig_path='b2g_leiden_results/figures',
met_path='b2g_leiden_results/metrics',
key_added='groups'
)
Parameters
Main Function: b2g.group()
| Parameter | Type | Default | Description |
|---|---|---|---|
adata |
AnnData | Required | AnnData object with raw counts |
batch_key |
str | 'batch' |
Column name in .obs for batches |
method |
str | 'metacell' |
Clustering method: 'metacell' or 'leiden' |
additional_features |
list | None |
Biological priors (see format below) |
target_metacell_size |
int | 48 |
Target metacell size (metacell method only) |
leiden_resolution |
float | 1.0 |
Leiden resolution (leiden method only) |
min_priors |
int | 1 |
Minimum priors to select |
max_priors |
int | None |
Maximum priors to select |
gap_threshold |
float | 0.1 |
Minimum score threshold for filtering |
output_dir |
str | None |
Output directory path |
fig_path |
str | None |
Figures directory path |
met_path |
str | None |
Metrics directory path |
key_added |
str | None |
Column name for grouping results |
Additional Features Format
additional_features = [
{'column': 'column_name_in_obs', 'description': 'Human readable description'},
# ... more features
]
Input Requirements
- Data Format: AnnData object compatible with Scanpy
- Count Matrix: Raw UMI counts (integers) required for metacell method
- Should be in
adata.Xoradata.raw.X - Not normalized, log-transformed, or scaled
- Should be in
- Batch Column: Categorical column in
adata.obsidentifying batches - Prior Columns (optional): Categorical columns in
adata.obsfor biological priors
Output
In AnnData Object
adata.obs[key_added]: Batch group assignments (e.g., 'G1', 'G2', 'G3')- Additional columns created during processing (e.g.,
metacell,prior_group)
Output Files (if paths specified)
output_dir/
├── figures/
│ ├── dendrogram.png # Hierarchical clustering dendrogram
│ ├── prior_selection_results.png # Prior evaluation visualization
│ └── alpha_optimization.png # Alpha parameter optimization
└── metrics/
├── prior_selection_log.csv # Detailed prior selection log
├── alpha_optimization.csv # Alpha evaluation results
└── clustering_results/
├── distance_matrix_square.npy
├── distance_matrix_condensed.npy
└── linkage_matrix_Z.npy
Advanced Usage
Custom Configuration
from b2g import AnalysisConfig, group_batches
# Create custom configuration
config = AnalysisConfig()
config.clustering_method = 'metacell'
config.b2g_metacell_params['target_metacell_size'] = 64
config.b2g_min_priors = 2
config.b2g_max_priors = 4
config.leiden_params['resolution'] = 2.0
# Run with custom config
adata = group_batches(adata, config, key_added='custom_groups')
Integration with Batch Correction Methods
import scanpy as sc
import b2g
# Step 1: Load data and subset to specific cell type
adata = sc.read_h5ad('data.h5ad')
adata_subset = adata[adata.obs['cell_type'] == 'Endothelial cells'].copy()
# Step 2: Run B2G grouping
b2g.group(
adata_subset,
batch_key='donor_id',
method='metacell',
target_metacell_size=48,
additional_features=[
{'column': 'disease', 'description': 'Disease status'},
{'column': 'sex', 'description': 'Biological sex'}
],
key_added='groups'
)
# Step 3: Standard preprocessing
sc.pp.normalize_total(adata_subset, target_sum=1e4)
sc.pp.log1p(adata_subset)
sc.pp.highly_variable_genes(adata_subset, n_top_genes=2000)
sc.pp.pca(adata_subset, n_comps=50)
# Step 4: Use B2G groups for batch correction
# With Harmony
sc.external.pp.harmony_integrate(adata_subset, key='groups')
# With Scanorama
import scanorama
adata_list = [adata_subset[adata_subset.obs['groups'] == g].copy()
for g in adata_subset.obs['groups'].unique()]
scanorama.integrate_scanpy(adata_list)
# With Combat
import scanpy.external as sce
sce.pp.combat(adata_subset, key='groups')
How It Works
- Prior Selection: Evaluates biological priors using PERMANOVA to identify informative features
- Clustering: Builds metacells or Leiden clusters, optionally grouped by selected priors
- Feature Extraction: Creates batch-feature matrix from clustering results
- Distance Calculation: Computes weighted distances between batches
- Hierarchical Clustering: Groups batches based on their feature similarity
- Dynamic Tree Cutting: Automatically determines optimal batch groups
Benchmark Datasets
B2G has been validated on multiple large-scale single-cell RNA-seq datasets and imaging datasets:
1. Human Skin Dataset
- Scale: 155,402 cells × 32,983 genes from 24 samples
- Sample Types: Basal cell carcinoma in the face (BCCface), healthy human skin in the face (UV-exposed), and inguino-iliac skin in the body (UV-protected)
- Cell Types: T cells, myeloid cells, fibroblasts, pericytes, Neuronal_Schwann cells
- Clinical Metadata: Treatment status (group), tissue location, age
- Original Data: ArrayExpress E-MTAB-13085
- Processed Data: Spatial Skin Atlas
2. Breast Tissue Dataset
- Scale: 714,331 cells × 32,383 genes from 126 female donors
- Cell Types: Fibroblasts, basal cells, vascular cells, T cells, myeloid cells
- Clinical Metadata: Sample source, suspension dissociation time, sequencing platform, tissue location, BMI group, procedure group, age group, breast density, self-reported ethnicity, developmental stage
- Original Data: GEO GSE195665
- Processed Data: CZ CELLxGENE
3. Intestinal Tissue Dataset
- Scale: 155,232 cells × 30,172 genes
- Sample Source: Multiple endodermal organs of the respiratory and gastrointestinal tracts during human development
- Cell Types: Mesenchymal cells, epithelial cells, immune cells, neuronal cells, endothelial cells
- Clinical Metadata: Sequencing platform, annotated organ, annotated tissue, sex, tissue type, developmental stage, alignment software
- Original Data: ArrayExpress E-MTAB-10187
- Processed Data: CZ CELLxGENE
4. Lung Tissue Dataset
- Scale: 116,313 nuclei × 33,523 genes from 26 individuals
- Technology: Single-nucleus RNA sequencing (snRNA-seq)
- Sample Source: Autopsy lung tissues from 19 deceased COVID-19 patients and lung tissues from 7 pre-pandemic control individuals
- Cell Types: Epithelial cells, myeloid cells, fibroblasts, endothelial cells, T cells
- Clinical Metadata: Disease status, sex, self-reported ethnicity, developmental stage
- Original Data: GEO GSE171524
- Processed Data: CZ CELLxGENE
5. Cell Painting Dataset
- Scale: 259 plates (99,440 wells total) from 12 laboratories
- Technology: Open reading frame (ORF) overexpression dataset from the Joint Undertaking for Morphological Profiling (JUMP, cpg0016)
- Features: 1,446 selected features (from 7,638 pre-extracted features using CellProfiler for brightfield and fluorescence images)
- Plate Format: Typically 16 rows × 24 columns (384 wells per plate)
- Data Processing: Excluded Laboratory 12 and BR00123528A plate due to quality issues
- Original Data:
- Raw data: AWS Open Data Registry
- Metadata and annotations: JUMP-MOA and JUMP Datasets
- Processed Data: [To be added]
Requirements
Tested Environment:
- Python: 3.10.19
Package Versions:
scanpy: 1.11.5
anndata: 0.11.4
numpy: 2.2.6
pandas: 2.3.3
scikit-learn: 1.7.2
scikit-bio: 0.7.1.post1
metacells: 0.9.5
dynamicTreeCut: 0.1.1
leidenalg: 0.11.0
These versions have been tested and verified to work correctly with B2G.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file b2g_tools-0.1.3.tar.gz.
File metadata
- Download URL: b2g_tools-0.1.3.tar.gz
- Upload date:
- Size: 573.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5afeeb47e5a5e2b80e92c5f8b172ebb382b2b8adbc8fb4d59974efbc0ccff9e3
|
|
| MD5 |
67a6634d321db98037ae67a4cdae8215
|
|
| BLAKE2b-256 |
bc825f184c7b30282070b376266dea117f1b7e4aa9b353cd5aafc52886b73d2f
|
File details
Details for the file b2g_tools-0.1.3-py3-none-any.whl.
File metadata
- Download URL: b2g_tools-0.1.3-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c74484c4cd70a6c8a76a8a40d48da93ff1c6c0c8ddea91b517ea07277af44ca2
|
|
| MD5 |
17f1feab6d6c21b2b21c82ed5122fd06
|
|
| BLAKE2b-256 |
e69bc1883a589d8262a40a16c7eae1c0302f0ddb6ff65618298aa53e31f9fb12
|