Skip to main content

Opinionated, report-first CLI for single-cell and multi-omics analysis. Sane defaults baked in, defensible deliverables out.

Project description

scellrun

CI License: MIT PyPI

scellrun stops two analysts — or two LLM agents — from getting two different answers on the same single-cell data.

Why this exists

Single-cell analysis has a documented reproducibility problem. From the field's own retrospective (Perspectives on rigor and reproducibility in single cell genomics): "in my group's experience, it is not unusual for reanalysis to find 20% fewer or more clusters in datasets" — same raw data, different analyst, different answer. The same review notes that of ~50 high-impact single-cell papers surveyed, "just a handful" reported any external validation. Most of the choices that drive that 20% divergence — mt% ceiling, HVG count, integration method, clustering resolution, panel pick — are made ad-hoc in a notebook and never written down.

A modern LLM agent handed an .h5ad and scanpy will write working code and produce a report. That solves "can the work happen". It does not solve "will two agents on the same data produce the same answer", "can a reviewer reconstruct why mt% was 20 and not 10 six months later", or "is the panel choice consistent with this lab's working practice on this tissue". Vanilla agents improvise thresholds, do not record the rationale in any machine-readable form, and have no way to encode the consensus a clinical-bioinformatics team has built over years of dogfooding.

scellrun fills that gap. Every threshold has a tested default with a one-sentence rationale; every choice the pipeline makes — auto, user-override, or LLM-recommended — is appended to a 00_decisions.jsonl file you can grep; tissue-specific working practice ships as profiles/ (cartilage today, contribute yours); each stage runs a self-check against PI-defined trigger thresholds and surfaces an actionable suggestion before the user sees the downstream finding. Different layer from a workflow manager: if you need orchestration across a cluster, use nf-core; scellrun is what you call from inside one of those. Not a replacement for scanpy either — it calls scanpy under the hood, with opinionated parameters and a decision log on top.

Who this is for

  • The LLM agent (Claude Code, Hermes, Codex) handling a clinician's request. This is the primary user. The agent ssh's to the data, runs scellrun analyze, reads the artifacts, and translates. skills/scellrun/SKILL.md is the operational guide it reads.
  • The clinician-bioinformatics team that wants every project to look the same in a report. Same QC layout, same decision table, same provenance trail across samples, students, and rotations.
  • The reviewer asking "why mt% 20?" The answer is 00_decisions.jsonl line 14, verbatim: "mt% ceiling 20.0% — joint tissue is stress-prone, the textbook 10% silently drops real chondrocytes (PI cohort 2024-2026, AIO PM=20)".

Quick start

conda create -n scellrun python=3.11 -y
conda activate scellrun
pip install scellrun

scellrun analyze data.h5ad --tissue "OA cartilage"
# → ./scellrun_out/run-<ts>/05_report/index.html

Don't have an .h5ad? Cellranger output works directly:

scellrun scrna convert path/to/cellranger_outs -o data.h5ad
scellrun analyze data.h5ad --tissue "OA cartilage"

Add --lang zh for a Chinese report. Add --profile joint-disease if the tissue is cartilage / synovium / subchondral bone (auto-loads the Fan 2024 chondrocyte panel and tightens hb% for avascular cartilage). Walkthrough in docs/quickstart.md; contribution notes in docs/contributing.md.

How an agent uses this

Drop skills/scellrun/SKILL.md into your agent's skills directory and the agent will know which command maps to which user intent, how to read the decision log, when to surface a self-check trigger before answering, and which profile to pick by tissue keyword. docs/agent-demo.md is a verbatim transcript of a Claude Code agent running scellrun end-to-end on real OA cartilage scRNA data — including the agent quoting the decision log when the user asks "why resolution 0.3?" and switching panels when the deterministic call is wrong.

What's in the decision log

00_decisions.jsonl is the single source of truth for every non-trivial choice the pipeline made. One JSON object per line. Sample shape (real, from docs/v1demo/decisions.jsonl):

{"schema_version":1,"stage":"qc","key":"max_pct_mt","value":20.0,"default":20.0,"source":"auto","rationale":"mt% ceiling 20.0% — joint tissue is stress-prone, the textbook 10% silently drops real chondrocytes (PI cohort 2024-2026, AIO PM=20)","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:20:32+00:00"}
{"schema_version":1,"stage":"analyze","key":"method_downgrade","value":"none","default":"harmony","source":"auto","rationale":"no sample/batch column in obs — single-sample input; auto-downgraded --method from harmony to none","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:18:51+00:00"}
{"schema_version":1,"stage":"analyze","key":"chosen_resolution_for_annotate","value":0.3,"default":null,"source":"auto","rationale":"fewest singletons → most balanced (every resolution fragmented) — picked res=0.3: n_clusters=13, largest=31.5%, smallest=0.2%, singletons=2","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:20:52+00:00"}
{"schema_version":1,"stage":"analyze","key":"annotate.auto_panel","value":"celltype_broad","default":null,"source":"auto","rationale":"swapped to celltype_broad: chondrocyte_hits=2, broad_hits=9; required >=1.5x margin to keep chondrocyte panel.","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:24:19+00:00"}
{"schema_version":1,"stage":"annotate","key":"panel","value":"celltype_broad","default":null,"source":"auto","rationale":"orchestrator-injected panel 'celltype_broad' (auto-pick or self-check fix)","fix_payload":null,"attempt_id":"cae89793d0f9470a9c7f38894928f304","ts":"2026-04-30T15:24:49+00:00"}

Every choice the pipeline made, with a one-sentence rationale, in a file you can grep. source is one of auto (a built-in heuristic), user (a CLI override), or ai (an LLM call). fix_payload is non-null only on self-check *.suggest rows — it carries the structured fix the orchestrator can mechanically apply when --auto-fix is on. attempt_id groups rows by invocation. The full schema is in skills/scellrun/SKILL.md.

Profiles

A profile is community-encoded working practice for a tissue domain — defaults plus marker panels — in one Python file. v1.0 ships two:

  • default — fresh-tissue 10x v3 chemistry, joint-tissue-aware mt% ceiling at 20% (the textbook 10% silently drops real chondrocytes; the OARSI working group ceiling is 20%).
  • joint-disease — same QC plus tighter hb% (cartilage is avascular), the Fan 2024 chondrocyte 11-subtype panel, and a 15-group celltype_broad panel. Auto-swaps from chondrocyte to broad when the data is immune-rich (subchondral bone, infiltrated synovium, joint fluid) so the report doesn't blindly mis-label pericytes / plasmacytoid DCs / osteoclasts as chondrocyte subtypes.
scellrun profiles list
scellrun profiles show joint-disease   # prints thresholds + panels

If your tissue or disease has working practice that diverges from the defaults, contribute a profile — one Python file under src/scellrun/profiles/.

Roadmap

v0.1 → v1.0.1 has shipped: per-stage QC / integrate / markers / annotate, the analyze one-shot, the decision log, self-check + --auto-fix, the joint-disease profile, panel auto-pick, single-sample auto-downgrade, agent demo, Dockerfile, and the v1.0.1 SKILL.md sync. The CLI surface is frozen for the v1.x series; new stages and profiles land additively. Post-v1.0 directions tracked in ROADMAP.md: conda-forge feedstock, registry-pushed Docker image, bulk RNA-seq subcommand, metabolomics composite scoring, proteomics integration.

License

MIT — see LICENSE.

Acknowledgements

Defaults trace to the in-house R AIO pipeline (Liu lab) and clinician-bioinformatics working practice for OARSI / MSK research. Built with assistance from Claude (Anthropic).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scellrun-1.1.0.tar.gz (158.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scellrun-1.1.0-py3-none-any.whl (98.7 kB view details)

Uploaded Python 3

File details

Details for the file scellrun-1.1.0.tar.gz.

File metadata

  • Download URL: scellrun-1.1.0.tar.gz
  • Upload date:
  • Size: 158.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for scellrun-1.1.0.tar.gz
Algorithm Hash digest
SHA256 71246d4da98a350dc289841503bf180968b0e13873e710b20588eb81e8f6847b
MD5 0470562c1441b8a97c30b39d9c0b1ace
BLAKE2b-256 863d8fb0592ef98c6406850a81657d35bd19121de376ebe5585377587fb4ee97

See more details on using hashes here.

File details

Details for the file scellrun-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: scellrun-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 98.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for scellrun-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94317f5717163d44e5751c54068e93d2be3827bd2a0187f9faf3b00466d551a8
MD5 c46fff7e49d779fa38bc4a9b5dbd2a2d
BLAKE2b-256 8054999fb3b56581c03721f67a3a1f13724e36c6a040e3081b3a02ba72577c2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page