Skip to main content

Pseudobulk and metacell aggregation for single-cell RNA-seq (AnnData / scanpy)

Project description

scagg

Pseudobulk and metacell aggregation for single-cell RNA-seq

scagg provides three aggregation strategies for turning single-cell data into sample-level or metacell-level objects, ready for bulk-style differential expression (DESeq2, edgeR, dream) or other downstream analyses:

Strategy Function When to use
Pseudobulk make_pseudobulk One aggregate per unique sample × cell-type combination
Metacells by size make_metacells_by_size ~N cells per metacell, maximises usage of all cells
Metacells by count make_metacells_by_num Exactly N metacells per group, balances replication

Available for Python (AnnData/scanpy) and R (Seurat). Full cell → metacell traceability is built in.

CI License: MIT


Installation

Python

pip install git+https://github.com/ethanfenton/scagg.git

Requirements: Python ≥ 3.10, anndata, numpy, pandas, scipy (all present in any standard scanpy environment).

R

if (!requireNamespace("remotes")) install.packages("remotes")
remotes::install_github("ethanfenton/scagg", subdir = "R/scagg")

Requirements: Seurat, dplyr.


Quick start — Python

import scanpy as sc
import scagg

adata = sc.read_h5ad("my_data.h5ad")

# True pseudobulk: one obs per sample × cell_type
pb = scagg.make_pseudobulk(
    adata,
    group_vars = ["cell_type", "sample"],
    save_membership = "results/pseudobulk_membership.csv",
)

# Metacells of ~10 cells each
mc_size = scagg.make_metacells_by_size(
    adata,
    group_vars = ["cell_type", "sample"],
    cell_size  = 10,
    save_membership = "results/metacell_size10_membership.csv",
)

# Exactly 5 metacells per cell_type × sample group
mc_num = scagg.make_metacells_by_num(
    adata,
    group_vars  = ["cell_type", "sample"],
    n_metacells = 5,
    save_membership = "results/metacell_num5_membership.csv",
)

Access results

# Result is a standard AnnData
mc_size.obs          # metacell metadata (n_cells, cell_type, sample, …)
mc_size.X            # summed counts matrix (metacells × genes)
mc_size.uns["metacell_membership"]  # DataFrame: cell_barcode → metacell_id

CLI

# Pseudobulk
scagg pseudobulk \
  --input data.h5ad --output results/pb.h5ad \
  --group-vars cell_type sample \
  --save-membership results/pb_membership.csv

# Metacells by size
scagg metacells-by-size \
  --input data.h5ad --output results/mc.h5ad \
  --group-vars cell_type sample \
  --cell-size 10 \
  --save-membership results/mc_membership.csv

# Metacells by count
scagg metacells-by-num \
  --input data.h5ad --output results/mc.h5ad \
  --group-vars cell_type sample \
  --n-metacells 5

Quick start — R (Seurat)

library(scagg)

# True pseudobulk
res <- make_pseudobulk(
  so_obj     = seurat_obj,
  group_vars = c("cell_type", "sample"),
  assays     = c("SCT", "RNA"),
  save_membership = "results/pseudobulk_membership.csv"
)
pb_so      <- res$obj         # aggregated Seurat object
membership <- res$membership  # data.frame: barcode → metacell_id

# Metacells of ~10 cells
res <- make_metacells_by_size(
  so_obj     = seurat_obj,
  group_vars = c("cell_type", "sample"),
  cell_size  = 10,
  save_membership = "results/mc_size10_membership.csv"
)

# Exactly 5 metacells per group
res <- make_metacells_by_num(
  so_obj      = seurat_obj,
  group_vars  = c("cell_type", "sample"),
  n_metacells = 5
)

Carry custom metadata

res <- make_metacells_by_size(
  so_obj     = seurat_obj,
  group_vars = c("cell_type", "sample"),
  cell_size  = 10,
  meta_vars  = c("Treatment", "Timepoint", "sex", "Prepper", "batch")
)

API reference

Python

scagg.make_pseudobulk(adata, group_vars, *, meta_vars, layer, save_membership)
scagg.make_metacells_by_size(adata, group_vars, cell_size, *, cell_min, meta_vars,
                              layer, seed, save_membership)
scagg.make_metacells_by_num(adata, group_vars, n_metacells, *, min_cells, meta_vars,
                             layer, seed, save_membership)
Parameter Default Notes
group_vars required One or more obs column names
cell_size required Target cells per metacell
cell_min 70 % of cell_size Min cells to include a group
n_metacells required Metacells per group
min_cells n_metacells Min cells to include a group
meta_vars all obs columns Which metadata to carry forward
layer None (use X) AnnData layer to aggregate
seed 42 RNG seed for reproducibility
save_membership None CSV path for traceability output

R

make_pseudobulk(so_obj, group_vars, meta_vars, assays, save_membership,
                idents_col = NULL)
make_metacells_by_size(so_obj, group_vars, cell_size, cell_min, meta_vars,
                       assays, seed, save_membership, idents_col = NULL)
make_metacells_by_num(so_obj, group_vars, n_metacells, min_cells, meta_vars,
                      assays, seed, save_membership, idents_col = NULL)

idents_col: name of any column in the result's metadata to set as the active Seurat Idents. NULL (default) leaves Idents unset. The column can be a pre-existing metadata column (e.g. "cell_type") or one you derive after aggregation:

All R functions return a named list:

  • $obj — aggregated Seurat object
  • $membership — data.frame mapping every input barcode to its metacell ID

Traceability output

Every function writes (optionally) a CSV with one row per input cell:

cell_barcode, metacell_id, cell_type, sample, Treatment, ...
AAACCCAGTCCGAACC-1, mc3_A_s1, ExN, s1, KO, ...
AAACCCATCAGCTTGC-1, mc3_A_s1, ExN, s1, KO, ...
AAACGAAGTCCTGTAG-1, mc1_B_s2, InN, s2, WT, ...
AAAGGATAGCTCCATG-1, unassigned, ExN, s3, KO, ...   ← below cell_min threshold

The metacell_id column uses the format mc{N}_{group_vars_joined}. Cells below the minimum threshold are marked unassigned.


Choosing a strategy

Pseudobulk is the gold standard for DE testing (recommended when you have ≥ 4 biological replicates per condition). It produces one observation per sample × cell-type and is unambiguous.

Metacells by size is best when samples have variable cell counts and you want to maximise the number of metacells from well-represented groups while naturally excluding sparse ones (below cell_min).

Metacells by count is best when you want equal replication across all groups regardless of cell count, e.g. for downstream tools that assume balanced designs.

In all cases, expression is aggregated by summing raw counts. If you are using normalised/scaled values (SCT slot), pass layer= pointing to the raw counts layer for biologically meaningful aggregation.


Migrating from your existing code

If you were using make_metacells_by_size(so_obj, cell_size=10) with hardcoded ct3/sample columns and a hardcoded time_treat Idents:

# Old (hardcoded ct3/sample, hardcoded time_treat Idents)
metacell_so <- make_metacells_by_size(so_obj, cell_size = 10)

# New (scagg, generalised)
res <- make_metacells_by_size(
  so_obj,
  group_vars = c("ct3", "sample"),
  cell_size  = 10,
  meta_vars  = c("Treatment", "Timepoint", "sex", "Prepper"),
  assays     = "SCT"
)
metacell_so <- res$obj

# Add a derived column, then point Idents at it
metacell_so$time_treat <- paste(metacell_so$Timepoint, metacell_so$Treatment, sep = "_")
Seurat::Idents(metacell_so) <- "time_treat"

# OR — pass idents_col directly if the column already exists in meta_vars:
res <- make_metacells_by_size(
  so_obj,
  group_vars = c("ct3", "sample"),
  cell_size  = 10,
  meta_vars  = c("Treatment", "Timepoint", "sex", "Prepper"),
  idents_col = "ct3"   # set Idents to cell type
)

License

MIT © 2026 Ethan Fenton

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scagg-0.1.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scagg-0.1.1-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file scagg-0.1.1.tar.gz.

File metadata

  • Download URL: scagg-0.1.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scagg-0.1.1.tar.gz
Algorithm Hash digest
SHA256 50605fde210b5afee75d3212e9783f8b7cab2a9f001dfdae193a99f4762dc9be
MD5 85be4ba8b0b44ca5e9889845d71ce768
BLAKE2b-256 e30afd6c22227051aa4eb368c1461018d3152368f4c1f09fb6be4a72e9903ff3

See more details on using hashes here.

Provenance

The following attestation bundles were made for scagg-0.1.1.tar.gz:

Publisher: publish.yml on ethanfenton/scagg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scagg-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scagg-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scagg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 10922273339f8da6fb39f569e359a64decb72ea7d18dc2b0186ae293027d0608
MD5 4b20cd61d63bf1999d76f82820d8a28e
BLAKE2b-256 170bcd62515ef2cfc1afdd2303159077e3b36cb67159a49335630b40852106d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scagg-0.1.1-py3-none-any.whl:

Publisher: publish.yml on ethanfenton/scagg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page