Skip to main content

Rust + PyO3 reimplementation of the full SCENIC+ pipeline — GRN, AUCell, topics, cistarget, peak calling, cell QC, enhancer→gene, eRegulon assembly. Installs and runs where arboreto+pyscenic+pycisTopic no longer do.

Project description

rustscenic

CI License: MIT Python Rust

A Rust + PyO3 replacement for the SCENIC / SCENIC+ compute stack: one install, modern Python, low-memory CPU execution, and atlas-scale regulatory-network analysis without Java, dask, CUDA, or fragile multi-tool environments.

pip install rustscenic

Five runtime dependencies (numpy, pandas, pyarrow, scipy, anndata). Python 3.10–3.13, Linux + macOS (x86_64 + aarch64). No dask, no Java, no CUDA.

The practical SCENIC+ compute path in one package:

flowchart LR
    rna["RNA<br/>AnnData"] --> grn["GRN"]
    atac["ATAC<br/>AnnData/fragments"] --> chrom["topics<br/>cisTarget<br/>enhancer links"]
    grn --> ereg["eRegulons"]
    chrom --> ereg
    grn --> auc["AUCell<br/>cells x regulons"]

Status

Current release: v0.4.3 on PyPI. v0.4.0 established publishable real-data end-to-end on PBMC and mouse brain E18 multiome via the public pipeline.run; v0.4.1 fixes pipeline.run(tfs="hs"/"mm") species shortcuts; v0.4.2 adds motif-annotation cisTarget pruning (synthetic-validated; real-data Kamath rerun pending), addressing the regulon-pruning gap surfaced by the Kamath DA-neuron community run (#68); v0.4.3 corrects PipelineResult.pruned_regulons_path to be None on pruning fallback, makes validation scripts NA-safe, and softens earlier scope claims. See CHANGELOG and validation/ for evidence and caveats.

Open follow-ups tracked for v0.5+: AUCell wall-time refresh against the current SCENIC+ stack (current numbers measured 2026-04 pre-v0.4.x), region-cistarget kernel parity vs ctxcore, normalised enrichment scores (NES) on top of cistarget AUCs to match pycistarget output scale, the six-dataset v0.4.x benchmark sweep (see docs/v0.4.x-benchmark-plan.md), and raw 10x pipeline.run without caller-side ATAC pre-subset (current docs require the subset).

Goal

rustscenic is being built as the single-install replacement for the practical SCENIC / SCENIC+ workflow: RNA GRN inference, AUCell regulon activity, motif enrichment, ATAC fragment preprocessing, topic modelling, enhancer-gene linking, and eRegulon assembly in one package.

The project is intentionally not a thin wrapper around the old stack. The target is a simpler architecture that makes regulatory-network analysis easier to install, cheaper to run on CPU, deterministic under a fixed seed, and robust to real atlas conventions such as ENSEMBL var_names, duplicate gene symbols, backed AnnData, and UCSC/Ensembl chromosome mismatches.

What it does

Rust-native replacements for the compute stages plus the glue that scenicplus builds eRegulons from:

Stage rustscenic Replaces
Gene-regulatory network inference rustscenic.grn.infer arboreto.grnboost2
Per-cell regulon activity scoring rustscenic.aucell.score pyscenic.aucell.aucell
Topic modelling on scATAC peaks (Online VB) rustscenic.topics.fit pycisTopic (gensim VB)
Topic modelling K ≥ 30 (Mallet-class collapsed Gibbs) rustscenic.topics.fit_gibbs pycisTopic (Mallet, Java)
Motif-regulon enrichment rustscenic.cistarget.enrich pycistarget AUC kernel
ATAC fragments → cells × peaks matrix rustscenic.preproc.fragments_to_matrix pycisTopic fragment loader
Cell QC (TSS enrichment, FRiP, insert size) rustscenic.preproc.qc pycisTopic.qc
Enhancer → gene correlation rustscenic.enhancer.link_peaks_to_genes scenicplus p2g linking
eRegulon assembly (TF × enhancers × target genes) rustscenic.eregulon.build_eregulons scenicplus eRegulon builder
End-to-end pipeline orchestrator rustscenic.pipeline.run scenicplus snakemake

Bundled with the wheel: HGNC (1,839 human) and MGI (1,721 mouse) TF lists via rustscenic.data.tfs(species). Motif rankings can be fetched and cached via rustscenic.data.download_motif_rankings. Cellxgene-curated h5ads (ENSEMBL IDs in var_names, gene symbols in var["feature_name"]) are auto-detected so atlas data works without manual patching.

Quick example (PBMC-3k, RNA GRN + AUCell)

import anndata as ad
import rustscenic.grn, rustscenic.aucell

import rustscenic.data

adata = ad.read_h5ad("rna.h5ad")
tfs = rustscenic.data.tfs("hs")  # bundled HGNC list (1,839 TFs)

# 1. GRN inference
grn = rustscenic.grn.infer(adata, tf_names=tfs, n_estimators=5000, seed=777)

# 2. Build top-50-target regulons and score per-cell activity
regulons = [
    (f"{tf}_regulon", grn[grn["TF"] == tf].nlargest(50, "importance")["target"].tolist())
    for tf in grn["TF"].unique()
]
auc = rustscenic.aucell.score(adata, regulons, top_frac=0.05)

Full RNA example script: examples/pbmc3k_end_to_end.py. Runs in about 3 minutes on an 8-core laptop with n_estimators=500. docs/tester-quickstart.md is the collaborator smoke-test path.

Measured against the pyscenic / arboreto reference

Same input on both sides. Every row has a log file under validation/.

Axis pyscenic / arboreto rustscenic
Installs on fresh Python 3.10–3.13 venv arboreto: TypeError: Must supply at least one delayed object (dask_expr); pyscenic: ModuleNotFoundError: pkg_resources in current stacks PyPI wheels and sdist install; core APIs import
AUCell wall-time, Ziegler 2021 atlas (31,602 × 59; measured 2026-04 pre-v0.4.x; refresh deferred to v0.5) 6.81 s (pyscenic) 0.25 s
AUCell wall-time, 10x Multiome (10,290 × 1,457; measured 2026-04 pre-v0.4.x; refresh deferred to v0.5) 18.6 s (pyscenic) 0.21 s
Peak RSS, 4 stages on 100,000 cells × 20,292 genes > 40 GB (reported) 6.3 GB
Cistarget kernel vs ctxcore.recovery.aucs reference Pearson 1.0000, mean abs diff 2.4 × 10⁻⁵
AUCell per-cell Pearson vs pyscenic (Ziegler, 31,602 cells; measured 2026-04 pre-v0.4.x; refresh deferred to v0.5) reference 0.984 mean, 91.7 % of cells > 0.95
Canonical airway TFs matching literature (Ziegler, n=14) 8 / 14 (pyscenic, unit weights) 8 / 14 — same hits, same 5/14 misses
Bit-identical output under same seed across threaded runs no (dask non-determinism) yes
Runtime dependencies 40 + 5

Tool-to-tool variation (same hits, same misses on the same 14 canonical TFs) is smaller than the dataset-inherent noise, consistent with rustscenic being numerically equivalent to pyscenic at the per-cell level.

Per-stage detail

Numbers are rustscenic's values. The measurement context (dataset, n_cells, version) is in each row. The parity refresh against current upstream stacks (six-dataset sweep) is now planned for v0.5+; see docs/v0.4.x-benchmark-plan.md for the dataset list and success criteria.

GRN — arboreto.grnboost2 replacement

Measurement Value
Per-edge Spearman vs arboreto (PBMC-3k scanpy, n_estimators=5000, 480,680 shared edges, v0.3.10) 0.611
Within-TF Spearman, mean across 1,274 TFs (same fixture) 0.632 (median 0.649)
Per-edge Spearman vs arboreto (multiome3k, n_estimators=5000, 816 k common edges, 2026-04) 0.58
Per-target TF-ranking Spearman mean 0.57
TRRUST known TF→target edges recovered (PBMC-3k) 17 / 18 (94 %)
Lineage TFs correctly enriched in expected cell types (PBMC-10k) 8 / 8 (SPI1, PAX5, EBF1, TCF7, LEF1, TBX21, CEBPD, IRF8)
Cortex marker TFs present in regulon set (E18 multiome, 4,770 cells, v0.3.10; name-presence, not cell-type enrichment) 9 / 9 (Pax6, Neurod2, Sox2, Ascl1, Tbr1, Neurog2, Fezf2, Eomes, Foxg1)
MITF regulon activity, Tirosh 2016 melanoma — malignant vs TME 3.48×
Wall vs pyscenic on PBMC-3k (n_estimators=5000, seed 777, Apple M5, v0.3.10; pyscenic in sync mode — not apples-to-apples against dask-parallel) 214 s vs 381 s (1.78×)
100k-cell bootstrap, n_estimators=100 17 min / 5.0 GB peak RSS

Edge rankings disagree with arboreto at fine grain (per-edge Spearman 0.611 on PBMC-3k v0.3.10 / 0.58 on multiome3k 2026-04, top-10k Jaccard 0.20) — expected consequence of independent histogram-GBM quantisation. Coarse biology converges (per-TF Spearman ≈ 0.65, all canonical lineage TFs recovered on both human PBMC and mouse cortex). Downstream AUCell is 0.99 per-cell with pyscenic, so edge-ranking differences do not propagate.

AUCell — pyscenic.aucell replacement

Measurement Value
Per-cell Pearson vs pyscenic (10x Multiome, 2,588 × 1,457) 0.988 mean, 99.5 % of cells > 0.95
Per-cell Pearson vs pyscenic (Ziegler atlas, 31,602 × 59) 0.984 mean, 91.7 % of cells > 0.95
Per-regulon Pearson (10x Multiome) 0.87 mean, 90.5 % > 0.80
Exact top-regulon-per-cell match (Multiome) 88.4 %
Wall-time, 10k cells × 1,457 regulons 0.21 s (vs 18.6 s pyscenic)
100 k cells × 500 regulons 10 s, 5.6 GB peak RSS

Topics — pycisTopic LDA replacement (Online VB + collapsed Gibbs)

Two algorithms ship side-by-side:

  • rustscenic.topics.fit — Online VB LDA, fastest at K ≤ 10.
  • rustscenic.topics.fit_gibbs — collapsed Gibbs (Mallet's algorithm class). Add n_threads=N for parallel AD-LDA.

Real PBMC 3k Multiome ATAC, 1,500 cells × 98,319 peaks, K = 30, intrinsic top-10 NPMI on the training corpus:

Tool Wall Unique topics (of 30) Top-10 NPMI mean
rustscenic.topics.fit (Online VB) 104 s 2 / 30 (collapsed) +0.012
rustscenic.topics.fit_gibbs (serial) 191 s 22 / 30 +0.031
rustscenic.topics.fit_gibbs (8-thread) 84 s 25 / 30 +0.019
Mallet (pycisTopic reference) n/a 24 / 30 0.196 (extrinsic)

Collapsed Gibbs gives ~11× more distinct topics than Online VB on sparse scATAC at K = 30 and ~2.7× higher intrinsic NPMI; the parallel AD-LDA path adds a 2.56× wall-clock speedup at 8 threads while preserving topic diversity. Mallet's published 0.196 is an extrinsic NPMI (different protocol, not directly comparable in absolute scale). See docs/topic-collapse.md and docs/bench-vs-references.md. Reproduce with python validation/scaling/bench_npmi_head_to_head.py and python validation/scaling/bench_gibbs_parallel.py.

Cistarget — pycistarget AUC kernel replacement

Validated on the aertslab hg38 v10 feather database (5,876 motifs × 27,015 genes):

Measurement Value
Per-regulon Pearson vs ctxcore.recovery.aucs (58 TRRUST regulons) 1.0000 (all > 0.9999, abs diff 2.4 × 10⁻⁵)
Self-consistency (motif's own top-500 genes → rank #1) 10 / 10
TRRUST at scale (166 TFs ≥ 10 targets): TF-annotated motif ranks #1 19 %
Same benchmark: any TF-motif in top-100 68 – 100 % (rises with regulon size)
Mouse mm10 cross-species (5 TRRUST TFs) 2 / 5 rank #1, 4 / 5 in top-5
100 k-cell workload × 100 regulons 2.6 s, 6.3 GB peak RSS

Bit-identical to ctxcore.recovery.aucs at float32 precision. The 19 % rank-#1 rate is the scaled-out TRRUST-vs-motif-binding benchmark, a property of the gold-standard mismatch, not the implementation.

End-to-end + determinism

Pipeline Wall Peak RSS Stages
Reference (arboreto + pyscenic + tomotopy), 10x Multiome 3k 11.8 min n/a 4
rustscenic, 10x Multiome 3k 9.1 min n/a 4
rustscenic, 10x PBMC 3k multiome real-data (v0.3.9, measured 2026-05-02) 7.5 min 3.67 GB 7 (all)
rustscenic, 10x brain E18 5k multiome real-data (v0.3.10, measured 2026-05-04) 13.8 min 4.01 GB 7 (all)
rustscenic, 10x PBMC granulocyte 10k multiome real-data (v0.4.3, measured 2026-05-11) 38.1 min 5.39 GB 7 (all)
rustscenic, 100k synthetic multiome E2E (measured v0.3.10, 2026-04-27) 12.7 min 7.09 GB 7 (all)
rustscenic, 200k synthetic multiome E2E (measured v0.3.10, 2026-04-27) 16.8 min 7.44 GB 7 (all)

Cross-dataset scaling on real 10x multiomes: 4.2x cell scale-up (2,767 to 11,620 cells) produces 5.1x wall (slope ~1.21x over the full span; intermediate-pair slopes are 1.06x and 1.14x, so the trajectory is slightly accelerating) and 1.47x peak RSS (sub-linear in cells). GRN dominates 78% of wall on the 10k run at n_estimators=100. Biology check on the latest run: 10 of 10 canonical PBMC and granulocyte transcription factors recovered by name (SPI1, CEBPA, CEBPB, CEBPE, IRF8, PAX5, EBF1, GATA3, TBX21, FOXP3); the brain E18 5k run recovered 9 of 9 cortex TFs. Name-presence checks against a regulon set of ~1,500 names from a TF list of ~1,800, not cell-type enrichment; the per-cluster AUCell F-test is tracked as a v0.5 follow-up. Memory: 100k synthetic multiome 7-stage E2E peaks at 7.09 GB RSS (measured v0.3.10; v0.4.x motif-pruning may shift this, refresh pending), vs scenicplus stack's reported > 40 GB at comparable scale. Bit-identical output under the same seed across threaded runs, verified across three consecutive runs per stage. 10 / 10 robustness edge-case tests pass (foreign genes, NaN input, duplicate gene names, all-zero cells, large regulons, object-dtype rankings, n_topics = 0, very-sparse matrices). Reproduce the real-data runs with the scripts under validation/multiome_pipeline_run_*.sh; reproduce the synthetic runs with python validation/scaling/bench_e2e_100k_synthetic.py and the 200k script.

Scope and alternatives

rustscenic covers the practical SCENIC / SCENIC+ compute path on CPU. Adjacent tools with different scope:

  • GPU, CUDAflashSCENIC (uses RegDiffusion, a different algorithm from GENIE3 / GRNBoost2, so outputs are not pyscenic-numerical).
  • Multiomic enhancer-aware GRNscenicplus (joint scRNA + scATAC enhancer inference; superset of this scope).
  • TF-activity scoring from prebuilt regulons, no GRN inferencedecoupler-py with CollecTRI.
  • R Bioconductor ecosystem — the original R-SCENIC or Epiregulon.

rustscenic does not bundle the aertslab motif ranking feather databases (300 MB – 35 GB). Users fetch them from resources.aertslab.org and pass the resulting DataFrame to cistarget.enrich.

CLI

# End-to-end orchestrator (recommended):
rustscenic pipeline  --rna data.h5ad --tfs tfs.txt --output out/

# Per-stage CLI:
rustscenic grn       --expression data.h5ad --tfs tfs.txt --output grn.parquet
rustscenic aucell    --expression data.h5ad --regulons grn.parquet --output auc.parquet
rustscenic topics    --expression atac.h5ad --output topics --n-topics 30
rustscenic cistarget --rankings motifs.feather --regulons grn.parquet --output enrichment.tsv

Repo layout

  • crates/ — Rust workspace: rustscenic-{grn, aucell, topics, preproc, py}
  • python/rustscenic/ — Python package, CLI entry point, type stubs
  • examples/pbmc3k_end_to_end.py — RNA GRN + AUCell script on real PBMC-3k
  • validation/ — reproducible benchmark scripts + measurement reports for every number above, plus VALIDATION_SUMMARY.md
  • tests/ — pytest suite (169 Python tests, 1 skipped) + Rust crate tests (57)
  • manuscript/ — preprint source
  • docs/topic-collapse.md — known algorithmic caveat

License

MIT. Algorithm implementations follow the aertslab Python references — original method credit to Aibar et al. 2017 (SCENIC), Bravo González-Blas et al. 2023 (SCENIC+), Hoffman-Blei-Bach 2010 (Online VB LDA).

Citation and attribution

If you use rustscenic in a paper, report, benchmark, derivative package, or lab workflow, cite the exact release used. GitHub citation metadata is in CITATION.cff.

rustscenic was created and is maintained by Ekin Kahraman. See AUTHORS.md and docs/collaboration-and-authorship.md for contribution and authorship expectations.

Contact

File issues at github.com/Ekin-Kahraman/rustscenic/issues. Bug, correctness, and validation-report templates pre-fill the fields we need. If you ran the pipeline on real data and want the result folded into the v0.4.x sweep, see docs/tester-reporting.md. If reporting ARI or related clustering metrics, include the comparator; see docs/evaluation-metrics.md. Coordinated vulnerability disclosure: see SECURITY.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rustscenic-0.4.3.tar.gz (140.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rustscenic-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (611.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

rustscenic-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (591.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

rustscenic-0.4.3-cp310-abi3-macosx_11_0_arm64.whl (553.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

rustscenic-0.4.3-cp310-abi3-macosx_10_12_x86_64.whl (580.2 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file rustscenic-0.4.3.tar.gz.

File metadata

  • Download URL: rustscenic-0.4.3.tar.gz
  • Upload date:
  • Size: 140.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for rustscenic-0.4.3.tar.gz
Algorithm Hash digest
SHA256 f352f2dd523ea94914ef409b0e716f66fbe64f59c3356fd9d9c25a9c3e182290
MD5 78769aa41ce75faf640016c4135a5709
BLAKE2b-256 fe28a8d43c22537b6f5d458b7de521a2c10350438a21804d15f6df5b31e3521e

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustscenic-0.4.3.tar.gz:

Publisher: release.yml on Ekin-Kahraman/rustscenic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustscenic-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rustscenic-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 987d8de3829dc892b0ff8c8b61144477784139887ecc0bf6c503182604f2b981
MD5 2e747e53cb8b710a09303b7368e34eb4
BLAKE2b-256 439f7fc005c56fa894baa85d434b75787ec1ab119ac7c47badf0f42ffc081e0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustscenic-0.4.3-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Ekin-Kahraman/rustscenic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustscenic-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for rustscenic-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 eb63ef3c7ef4d26a30edeb9b3207d4df15639b95f67bb2966001dd8c8d76292f
MD5 1cbad59966b126127aca4c37ba4ba00f
BLAKE2b-256 2a8e18e572902669c7bb47231f25f00579d8bc20769edbdfc03cea387d0eba3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustscenic-0.4.3-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Ekin-Kahraman/rustscenic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustscenic-0.4.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for rustscenic-0.4.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4fece83931bc45dea69f17f0bf9cf532b499330d7d97893bb3439ee707a8244e
MD5 48f9ae23fded131e9143e2d4c9f9dabf
BLAKE2b-256 3b481dd62f04a6a5e1d291457f472e96260da591a520c82dbfe22df362cb3fcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustscenic-0.4.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Ekin-Kahraman/rustscenic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rustscenic-0.4.3-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for rustscenic-0.4.3-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 501815254c200d9d5dd0f00b17ea1fd5b2e5c590b7e20150293318431df67294
MD5 629bee1f1686e0196c3c28d65bf32219
BLAKE2b-256 64a2393703cdd413af90f402d0fbf811e9a68a65e4e79fbf31ecaa5d6b77eb55

See more details on using hashes here.

Provenance

The following attestation bundles were made for rustscenic-0.4.3-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Ekin-Kahraman/rustscenic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page