Pseudobulk and metacell aggregation for single-cell RNA-seq (AnnData / scanpy)
Project description
scagg
Pseudobulk and metacell aggregation for single-cell RNA-seq
scagg provides three aggregation strategies for turning single-cell data into
sample-level or metacell-level objects, ready for bulk-style differential
expression (DESeq2, edgeR, dream) or other downstream analyses:
| Strategy | Function | When to use |
|---|---|---|
| Pseudobulk | make_pseudobulk |
One aggregate per unique sample × cell-type combination |
| Metacells by size | make_metacells_by_size |
~N cells per metacell, maximises usage of all cells |
| Metacells by count | make_metacells_by_num |
Exactly N metacells per group, balances replication |
Available for Python (AnnData/scanpy) and R (Seurat). Full cell → metacell traceability is built in.
Installation
Python
pip install git+https://github.com/ethanfenton/scagg.git
Requirements: Python ≥ 3.10, anndata, numpy, pandas, scipy (all present in any standard scanpy environment).
R
if (!requireNamespace("remotes")) install.packages("remotes")
remotes::install_github("ethanfenton/scagg", subdir = "R/scagg")
Requirements: Seurat, dplyr.
Quick start — Python
import scanpy as sc
import scagg
adata = sc.read_h5ad("my_data.h5ad")
# True pseudobulk: one obs per sample × cell_type
pb = scagg.make_pseudobulk(
adata,
group_vars = ["cell_type", "sample"],
save_membership = "results/pseudobulk_membership.csv",
)
# Metacells of ~10 cells each
mc_size = scagg.make_metacells_by_size(
adata,
group_vars = ["cell_type", "sample"],
cell_size = 10,
save_membership = "results/metacell_size10_membership.csv",
)
# Exactly 5 metacells per cell_type × sample group
mc_num = scagg.make_metacells_by_num(
adata,
group_vars = ["cell_type", "sample"],
n_metacells = 5,
save_membership = "results/metacell_num5_membership.csv",
)
Access results
# Result is a standard AnnData
mc_size.obs # metacell metadata (n_cells, cell_type, sample, …)
mc_size.X # summed counts matrix (metacells × genes)
mc_size.uns["metacell_membership"] # DataFrame: cell_barcode → metacell_id
CLI
# Pseudobulk
scagg pseudobulk \
--input data.h5ad --output results/pb.h5ad \
--group-vars cell_type sample \
--save-membership results/pb_membership.csv
# Metacells by size
scagg metacells-by-size \
--input data.h5ad --output results/mc.h5ad \
--group-vars cell_type sample \
--cell-size 10 \
--save-membership results/mc_membership.csv
# Metacells by count
scagg metacells-by-num \
--input data.h5ad --output results/mc.h5ad \
--group-vars cell_type sample \
--n-metacells 5
Quick start — R (Seurat)
library(scagg)
# True pseudobulk
res <- make_pseudobulk(
so_obj = seurat_obj,
group_vars = c("cell_type", "sample"),
assays = c("SCT", "RNA"),
save_membership = "results/pseudobulk_membership.csv"
)
pb_so <- res$obj # aggregated Seurat object
membership <- res$membership # data.frame: barcode → metacell_id
# Metacells of ~10 cells
res <- make_metacells_by_size(
so_obj = seurat_obj,
group_vars = c("cell_type", "sample"),
cell_size = 10,
save_membership = "results/mc_size10_membership.csv"
)
# Exactly 5 metacells per group
res <- make_metacells_by_num(
so_obj = seurat_obj,
group_vars = c("cell_type", "sample"),
n_metacells = 5
)
Carry custom metadata
res <- make_metacells_by_size(
so_obj = seurat_obj,
group_vars = c("cell_type", "sample"),
cell_size = 10,
meta_vars = c("Treatment", "Timepoint", "sex", "Prepper", "batch")
)
API reference
Python
scagg.make_pseudobulk(adata, group_vars, *, meta_vars, layer, save_membership)
scagg.make_metacells_by_size(adata, group_vars, cell_size, *, cell_min, meta_vars,
layer, seed, save_membership)
scagg.make_metacells_by_num(adata, group_vars, n_metacells, *, min_cells, meta_vars,
layer, seed, save_membership)
| Parameter | Default | Notes |
|---|---|---|
group_vars |
required | One or more obs column names |
cell_size |
required | Target cells per metacell |
cell_min |
70 % of cell_size | Min cells to include a group |
n_metacells |
required | Metacells per group |
min_cells |
n_metacells | Min cells to include a group |
meta_vars |
all obs columns | Which metadata to carry forward |
layer |
None (use X) | AnnData layer to aggregate |
seed |
42 | RNG seed for reproducibility |
save_membership |
None | CSV path for traceability output |
R
make_pseudobulk(so_obj, group_vars, meta_vars, assays, save_membership,
idents_col = NULL)
make_metacells_by_size(so_obj, group_vars, cell_size, cell_min, meta_vars,
assays, seed, save_membership, idents_col = NULL)
make_metacells_by_num(so_obj, group_vars, n_metacells, min_cells, meta_vars,
assays, seed, save_membership, idents_col = NULL)
idents_col: name of any column in the result's metadata to set as the active
Seurat Idents. NULL (default) leaves Idents unset. The column can be a
pre-existing metadata column (e.g. "cell_type") or one you derive after
aggregation:
All R functions return a named list:
$obj— aggregated Seurat object$membership— data.frame mapping every input barcode to its metacell ID
Traceability output
Every function writes (optionally) a CSV with one row per input cell:
cell_barcode, metacell_id, cell_type, sample, Treatment, ...
AAACCCAGTCCGAACC-1, mc3_A_s1, ExN, s1, KO, ...
AAACCCATCAGCTTGC-1, mc3_A_s1, ExN, s1, KO, ...
AAACGAAGTCCTGTAG-1, mc1_B_s2, InN, s2, WT, ...
AAAGGATAGCTCCATG-1, unassigned, ExN, s3, KO, ... ← below cell_min threshold
The metacell_id column uses the format mc{N}_{group_vars_joined}.
Cells below the minimum threshold are marked unassigned.
Choosing a strategy
Pseudobulk is the gold standard for DE testing (recommended when you have ≥ 4 biological replicates per condition). It produces one observation per sample × cell-type and is unambiguous.
Metacells by size is best when samples have variable cell counts and you
want to maximise the number of metacells from well-represented groups while
naturally excluding sparse ones (below cell_min).
Metacells by count is best when you want equal replication across all groups regardless of cell count, e.g. for downstream tools that assume balanced designs.
In all cases, expression is aggregated by summing raw counts. If you are
using normalised/scaled values (SCT slot), pass layer= pointing to the raw
counts layer for biologically meaningful aggregation.
Migrating from your existing code
If you were using make_metacells_by_size(so_obj, cell_size=10) with
hardcoded ct3/sample columns and a hardcoded time_treat Idents:
# Old (hardcoded ct3/sample, hardcoded time_treat Idents)
metacell_so <- make_metacells_by_size(so_obj, cell_size = 10)
# New (scagg, generalised)
res <- make_metacells_by_size(
so_obj,
group_vars = c("ct3", "sample"),
cell_size = 10,
meta_vars = c("Treatment", "Timepoint", "sex", "Prepper"),
assays = "SCT"
)
metacell_so <- res$obj
# Add a derived column, then point Idents at it
metacell_so$time_treat <- paste(metacell_so$Timepoint, metacell_so$Treatment, sep = "_")
Seurat::Idents(metacell_so) <- "time_treat"
# OR — pass idents_col directly if the column already exists in meta_vars:
res <- make_metacells_by_size(
so_obj,
group_vars = c("ct3", "sample"),
cell_size = 10,
meta_vars = c("Treatment", "Timepoint", "sex", "Prepper"),
idents_col = "ct3" # set Idents to cell type
)
License
MIT © 2026 Ethan Fenton
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scagg-0.1.1.tar.gz.
File metadata
- Download URL: scagg-0.1.1.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50605fde210b5afee75d3212e9783f8b7cab2a9f001dfdae193a99f4762dc9be
|
|
| MD5 |
85be4ba8b0b44ca5e9889845d71ce768
|
|
| BLAKE2b-256 |
e30afd6c22227051aa4eb368c1461018d3152368f4c1f09fb6be4a72e9903ff3
|
Provenance
The following attestation bundles were made for scagg-0.1.1.tar.gz:
Publisher:
publish.yml on ethanfenton/scagg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scagg-0.1.1.tar.gz -
Subject digest:
50605fde210b5afee75d3212e9783f8b7cab2a9f001dfdae193a99f4762dc9be - Sigstore transparency entry: 1454055490
- Sigstore integration time:
-
Permalink:
ethanfenton/scagg@31b524842ba8a12ec6a71a632ea9c5df34f90f0f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ethanfenton
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31b524842ba8a12ec6a71a632ea9c5df34f90f0f -
Trigger Event:
release
-
Statement type:
File details
Details for the file scagg-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scagg-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10922273339f8da6fb39f569e359a64decb72ea7d18dc2b0186ae293027d0608
|
|
| MD5 |
4b20cd61d63bf1999d76f82820d8a28e
|
|
| BLAKE2b-256 |
170bcd62515ef2cfc1afdd2303159077e3b36cb67159a49335630b40852106d1
|
Provenance
The following attestation bundles were made for scagg-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on ethanfenton/scagg
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scagg-0.1.1-py3-none-any.whl -
Subject digest:
10922273339f8da6fb39f569e359a64decb72ea7d18dc2b0186ae293027d0608 - Sigstore transparency entry: 1454055627
- Sigstore integration time:
-
Permalink:
ethanfenton/scagg@31b524842ba8a12ec6a71a632ea9c5df34f90f0f -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/ethanfenton
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@31b524842ba8a12ec6a71a632ea9c5df34f90f0f -
Trigger Event:
release
-
Statement type: