Skip to main content

Opinionated, report-first CLI for single-cell and multi-omics analysis. Sane defaults baked in, defensible deliverables out.

Project description

scellrun

CI License: MIT PyPI

scellrun stops two analysts — or two LLM agents — from getting two different answers on the same single-cell data.

Why this exists

Single-cell analysis has a documented reproducibility problem. From the field's own retrospective (Perspectives on rigor and reproducibility in single cell genomics): "in my group's experience, it is not unusual for reanalysis to find 20% fewer or more clusters in datasets" — same raw data, different analyst, different answer. The same review notes that of ~50 high-impact single-cell papers surveyed, "just a handful" reported any external validation. Most of the choices that drive that 20% divergence — mt% ceiling, HVG count, integration method, clustering resolution, panel pick — are made ad-hoc in a notebook and never written down.

A modern LLM agent handed an .h5ad and scanpy will write working code and produce a report. That solves "can the work happen". It does not solve "will two agents on the same data produce the same answer", "can a reviewer reconstruct why mt% was 20 and not 10 six months later", or "is the panel choice consistent with this lab's working practice on this tissue". Vanilla agents improvise thresholds, do not record the rationale in any machine-readable form, and have no way to encode the consensus a clinical-bioinformatics team has built over years of dogfooding.

scellrun fills that gap. Every threshold has a tested default with a one-sentence rationale; every choice the pipeline makes — auto, user-override, or LLM-recommended — is appended to a 00_decisions.jsonl file you can grep; tissue-specific working practice ships as profiles/ (cartilage today, contribute yours); each stage runs a self-check against PI-defined trigger thresholds and surfaces an actionable suggestion before the user sees the downstream finding. Different layer from a workflow manager: if you need orchestration across a cluster, use nf-core; scellrun is what you call from inside one of those. Not a replacement for scanpy either — it calls scanpy under the hood, with opinionated parameters and a decision log on top.

Who this is for

  • The LLM agent (Claude Code, Hermes, Codex) handling a clinician's request. This is the primary user. The agent ssh's to the data, runs scellrun analyze, reads the artifacts, and translates. skills/scellrun/SKILL.md is the operational guide it reads.
  • The clinician-bioinformatics team that wants every project to look the same in a report. Same QC layout, same decision table, same provenance trail across samples, students, and rotations.
  • The reviewer asking "why mt% 20?" The answer is 00_decisions.jsonl line 14, verbatim: "mt% ceiling 20.0% — joint tissue is stress-prone, the textbook 10% silently drops real chondrocytes (PI cohort 2024-2026, AIO PM=20)".

Quick start

conda create -n scellrun python=3.11 -y
conda activate scellrun
pip install scellrun

scellrun analyze data.h5ad --tissue "OA cartilage"
# → ./scellrun_out/run-<ts>/05_report/index.html

Don't have an .h5ad? Cellranger output works directly:

scellrun scrna convert path/to/cellranger_outs -o data.h5ad
scellrun analyze data.h5ad --tissue "OA cartilage"

Add --lang zh for a Chinese report. Add --profile joint-disease if the tissue is cartilage / synovium / subchondral bone (auto-loads the Fan 2024 chondrocyte panel and tightens hb% for avascular cartilage). Walkthrough in docs/quickstart.md; contribution notes in docs/contributing.md.

How an agent uses this

Drop skills/scellrun/SKILL.md into your agent's skills directory and the agent will know which command maps to which user intent, how to read the decision log, when to surface a self-check trigger before answering, and which profile to pick by tissue keyword. docs/agent-demo.md is a verbatim transcript of a Claude Code agent running scellrun end-to-end on real OA cartilage scRNA data — including the agent quoting the decision log when the user asks "why resolution 0.3?" and switching panels when the deterministic call is wrong.

What's in the decision log

00_decisions.jsonl is the single source of truth for every non-trivial choice the pipeline made. One JSON object per line. Sample shape (real, from docs/v1demo/decisions.jsonl):

{"schema_version":1,"stage":"qc","key":"max_pct_mt","value":20.0,"default":20.0,"source":"auto","rationale":"mt% ceiling 20.0% — joint tissue is stress-prone, the textbook 10% silently drops real chondrocytes (PI cohort 2024-2026, AIO PM=20)","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:20:32+00:00"}
{"schema_version":1,"stage":"analyze","key":"method_downgrade","value":"none","default":"harmony","source":"auto","rationale":"no sample/batch column in obs — single-sample input; auto-downgraded --method from harmony to none","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:18:51+00:00"}
{"schema_version":1,"stage":"analyze","key":"chosen_resolution_for_annotate","value":0.3,"default":null,"source":"auto","rationale":"fewest singletons → most balanced (every resolution fragmented) — picked res=0.3: n_clusters=13, largest=31.5%, smallest=0.2%, singletons=2","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:20:52+00:00"}
{"schema_version":1,"stage":"analyze","key":"annotate.auto_panel","value":"celltype_broad","default":null,"source":"auto","rationale":"swapped to celltype_broad: chondrocyte_hits=2, broad_hits=9; required >=1.5x margin to keep chondrocyte panel.","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:24:19+00:00"}
{"schema_version":1,"stage":"annotate","key":"panel","value":"celltype_broad","default":null,"source":"auto","rationale":"orchestrator-injected panel 'celltype_broad' (auto-pick or self-check fix)","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:24:49+00:00"}

Every choice the pipeline made, with a one-sentence rationale, in a file you can grep. source is one of auto (a built-in heuristic), user (a CLI override), or ai (an LLM call). fix_payload is non-null only on self-check *.suggest rows — it carries the structured fix the orchestrator can mechanically apply when --auto-fix is on. attempt_id groups rows by invocation. The full schema is in skills/scellrun/SKILL.md.

Profiles

A profile is community-encoded working practice for a tissue domain — defaults plus marker panels — in one Python file. v1.0 ships two:

  • default — fresh-tissue 10x v3 chemistry, joint-tissue-aware mt% ceiling at 20% (the textbook 10% silently drops real chondrocytes; the OARSI working group ceiling is 20%).
  • joint-disease — same QC plus tighter hb% (cartilage is avascular), the Fan 2024 chondrocyte 11-subtype panel, and a 15-group celltype_broad panel. Auto-swaps from chondrocyte to broad when the data is immune-rich (subchondral bone, infiltrated synovium, joint fluid) so the report doesn't blindly mis-label pericytes / plasmacytoid DCs / osteoclasts as chondrocyte subtypes.
scellrun profiles list
scellrun profiles show joint-disease   # prints thresholds + panels

If your tissue or disease has working practice that diverges from the defaults, contribute a profile — one Python file under src/scellrun/profiles/.

Roadmap

v0.1 → v1.0.1 has shipped: per-stage QC / integrate / markers / annotate, the analyze one-shot, the decision log, self-check + --auto-fix, the joint-disease profile, panel auto-pick, single-sample auto-downgrade, agent demo, Dockerfile, and the v1.0.1 SKILL.md sync. The CLI surface is frozen for the v1.x series; new stages and profiles land additively. Post-v1.0 directions tracked in ROADMAP.md: conda-forge feedstock, registry-pushed Docker image, bulk RNA-seq subcommand, metabolomics composite scoring, proteomics integration.

License

MIT — see LICENSE.

Acknowledgements

Defaults trace to the in-house R AIO pipeline (Liu lab) and clinician-bioinformatics working practice for OARSI / MSK research. Built with assistance from Claude (Anthropic).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scellrun-1.1.2.tar.gz (165.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scellrun-1.1.2-py3-none-any.whl (98.7 kB view details)

Uploaded Python 3

File details

Details for the file scellrun-1.1.2.tar.gz.

File metadata

  • Download URL: scellrun-1.1.2.tar.gz
  • Upload date:
  • Size: 165.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for scellrun-1.1.2.tar.gz
Algorithm Hash digest
SHA256 3867e6bf3d5013af2d0f563a3fc7e912ea74e60e47b885fe3efd78904912bc24
MD5 c20d58dca872c94293ae433e4b772490
BLAKE2b-256 b5add3b3c05315c5105f7feded6ba3daf7e01e1b5df16d150afea63919695511

See more details on using hashes here.

File details

Details for the file scellrun-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: scellrun-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 98.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for scellrun-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7347991e37b9f7c18997fe476574a74ebfb151c7a78fbfa6d43305d84cbeeff8
MD5 20ae418e69c9ea82ae0c22b989752e74
BLAKE2b-256 c51d26c05e683150fb5de3f3de0b10c44dfef97e6403d72ff794a3b1d8009c53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page