Skip to main content

AnnData-native deep learning baselines for single-cell data.

Project description

scDLKit

CI Docs PyPI version Python versions License PyPI total downloads GitHub stars

Train, evaluate, compare, and visualize baseline deep-learning models for single-cell data without writing PyTorch from scratch.

Choose the entrypoint by user goal:

  • stable embeddings and baseline models: TaskRunner
  • experimental labeled annotation adaptation: adapt_annotation(...)
  • lower-level control and custom models: Trainer plus adapters

Quick Start

Start here if you want the shortest path from AnnData to a learned embedding and, for reconstruction-capable models, predicted or reconstructed gene-expression values:

  1. load an AnnData
  2. fit a baseline model
  3. get the learned embedding
  4. optionally get predicted or reconstructed gene-expression values
  5. continue in Scanpy
import scanpy as sc
from scdlkit import TaskRunner

adata = sc.datasets.pbmc3k_processed()

runner = TaskRunner(
    model="vae",
    task="representation",
    label_key="louvain",
    device="auto",
    epochs=20,
    batch_size=128,
    model_kwargs={"kl_weight": 1e-3},
)

runner.fit(adata)

# Cell embedding for downstream Scanpy analysis.
adata.obsm["X_scdlkit_vae"] = runner.encode(adata)

# Predicted / reconstructed gene expression for reconstruction-capable models.
predicted_expression = runner.reconstruct(adata)

Then keep the normal Scanpy path:

sc.pp.neighbors(adata, use_rep="X_scdlkit_vae")
sc.tl.umap(adata)
sc.pl.umap(adata, color="louvain")

Notes:

  • runner.encode(...) returns the latent embedding.
  • runner.reconstruct(...) returns reconstructed gene-expression values for reconstruction-capable models such as autoencoder, vae, denoising_autoencoder, and transformer_ae.
  • runner.predict(...) remains backward compatible, but reconstruct(...) is the clearer public path for reconstructed expression.
  • Classification models return class predictions instead of reconstructed expression.
  • Frozen scGPT in the experimental foundation path exposes embeddings only, not reconstructed expression.

What you get from this quickstart:

  • a learned embedding in adata.obsm
  • reconstructed gene-expression values when the model supports them
  • training metrics and saved reports
  • a direct continuation path into Scanpy

Related docs:

Fine-Tuning Quickstart

If your main goal is cell-type annotation on a labeled human AnnData, the experimental scGPT wrapper path is now also a first-class quickstart:

from scdlkit import adapt_annotation

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    output_dir="artifacts/scgpt_annotation",
)

runner.annotate_adata(adata, obs_key="scgpt_label", embedding_key="X_scgpt_best")
runner.save("artifacts/scgpt_annotation/best_model")

This path is designed for researchers who want:

  • a low-code fine-tuning workflow
  • frozen versus tuned strategy comparison
  • predictions written back into adata.obs
  • latent embeddings written back into adata.obsm
  • a saved runner they can reload later

This fine-tuning path is still experimental and intentionally narrow:

  • human scRNA-seq only
  • official scGPT whole-human checkpoint only
  • annotation only
  • default quickstart comparison is frozen_probe plus head
  • LoRA remains opt-in through strategies=("frozen_probe", "head", "lora")

Related docs:

Start Here

  • Documentation site: https://uddamvathanak.github.io/scDLKit/
  • Primary notebook tutorial: examples/train_vae_pbmc.ipynb
  • Install path for tutorials: python -m pip install "scdlkit[tutorials]"
  • Experimental foundation path: python -m pip install "scdlkit[foundation,tutorials]"
  • CPU and GPU use the same notebook path through device="auto"
  • Core learning path: quickstart -> downstream Scanpy -> comparison -> reconstruction sanity check
  • Researcher shortcut for labeled data: quickstart -> experimental scGPT cell-type annotation -> experimental dataset-specific annotation
  • Secondary notebooks: examples/compare_models_pbmc.ipynb, examples/classification_demo.ipynb
  • Downstream Scanpy notebook: examples/downstream_scanpy_after_scdlkit.ipynb
  • Reconstruction notebook: examples/reconstruction_sanity_pbmc.ipynb
  • Custom model notebook: examples/custom_model_extension.ipynb
  • Experimental foundation notebook: examples/scgpt_pbmc_embeddings.ipynb
  • Experimental annotation fine-tuning notebook: examples/scgpt_cell_type_annotation.ipynb
  • Experimental dataset-specific wrapper notebook: examples/scgpt_dataset_specific_annotation.ipynb
  • API routing page: docs/api/index.md
  • Synthetic smoke examples: examples/first_run_synthetic.ipynb, examples/first_run_synthetic.py

Why scDLKit

  • AnnData-native workflow for single-cell users.
  • Baseline-first model zoo: AE, VAE, DAE, Transformer AE, and MLP classification.
  • Built-in training, evaluation, comparison, and plotting.
  • Reproducible reports and notebooks for portfolio-ready demonstrations.
  • Built-in benchmark gates on small Scanpy datasets before tutorial defaults change.
  • Gene-expression-focused scope while the core toolkit stabilizes.
  • Experimental frozen scGPT embeddings for human PBMC workflows.
  • Experimental scGPT annotation fine-tuning with head-only and LoRA strategies.

Supported platforms

  • Linux: supported
  • macOS: supported
  • Windows: supported

Installation

Primary tutorial install path:

python -m pip install "scdlkit[tutorials]"

Windows note: if you install into a deeply nested virtual environment path, Jupyter dependencies can hit Windows path-length limits. Use a short environment path such as C:\venvs\scdlkit, or enable Windows Long Paths if needed.

Optional extras:

python -m pip install "scdlkit[scanpy]"
python -m pip install "scdlkit[notebook]"
python -m pip install "scdlkit[foundation]"
python -m pip install scdlkit
python -m pip install "scdlkit[dev,docs]"

For GPU users, install the matching PyTorch build first using the official selector:

Then install scdlkit[tutorials]. The same notebook examples run on CPU or GPU with device="auto".

Scanpy Quickstart

Primary tutorial example. The notebook uses a quickstart profile by default and exposes a full profile in its first config cell:

  • quickstart: CPU-friendly, docs-friendly, reproducible
  • full: longer run for stronger qualitative separation

For the PBMC quickstart, use a light VAE KL term so the latent UMAP preserves broad cell-type structure instead of collapsing into a uniform blob. A healthy result should show broad cell-type groups as visibly separated regions rather than a single mixed cloud.

Notebook-First Examples

Most researchers should start with the Scanpy PBMC quickstart:

python -m pip install "scdlkit[tutorials]"
jupyter notebook examples/train_vae_pbmc.ipynb

This notebook:

  • loads PBMC data through Scanpy
  • trains a VAE baseline with scDLKit
  • writes the latent representation into adata.obsm
  • continues with Scanpy neighbors and UMAP
  • points to the downstream Scanpy and reconstruction tutorials for the next interpretation steps
  • explains the quickstart versus full tutorial profiles
  • works on CPU or GPU through device="auto"

Additional Scanpy-first notebooks:

  • examples/downstream_scanpy_after_scdlkit.ipynb: take the scDLKit embedding through Leiden clustering, marker ranking, dotplots, and coarse annotation
  • examples/compare_models_pbmc.ipynb: compare PCA, autoencoder, vae, and transformer_ae
  • examples/reconstruction_sanity_pbmc.ipynb: inspect reconstructed gene-expression outputs with a dedicated reconstruction baseline
  • examples/classification_demo.ipynb: run the mlp_classifier baseline and inspect a confusion matrix
  • examples/custom_model_extension.ipynb: wrap a raw PyTorch autoencoder and train it through Trainer
  • examples/scgpt_pbmc_embeddings.ipynb: run the experimental frozen whole-human scGPT embedding workflow and return to Scanpy through adata.obsm
  • examples/scgpt_cell_type_annotation.ipynb: compare PCA + logistic regression, frozen scGPT, head-only tuning, and LoRA tuning for labeled PBMC annotation
  • examples/scgpt_dataset_specific_annotation.ipynb: use the new wrapper-first adapt_annotation(...) flow on a second labeled PBMC dataset and save the best fitted runner

The synthetic notebook and script are still available, but they are now the smoke-test path rather than the primary researcher onboarding flow:

python -m pip install "scdlkit[notebook]"
jupyter notebook examples/first_run_synthetic.ipynb

python examples/first_run_synthetic.py

These write small reproducible artifacts to artifacts/first_run_notebook/ and artifacts/first_run/.

Optional contributor Conda environment

Conda is kept for contributors and demos. It is not the primary public install path.

Official installers:

From the repo root:

conda env create -f environment.yml
conda activate scdlkit

Core APIs

High-level:

from scdlkit import TaskRunner

Lower-level:

from scdlkit import Trainer, create_model, prepare_data

Custom-model adapters:

from scdlkit.adapters import wrap_classification_module, wrap_reconstruction_module

Custom wrapped models are supported through Trainer first. TaskRunner remains the built-in high-level path for bundled scDLKit models.

Experimental foundation helpers:

from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data

Experimental scGPT annotation tuning:

from scdlkit.foundation import (
    load_scgpt_annotation_model,
    prepare_scgpt_data,
    split_scgpt_data,
)

Experimental wrapper-first adaptation:

from scdlkit import adapt_annotation

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    output_dir="artifacts/scgpt_annotation",
)
runner.annotate_adata(adata)
runner.save("artifacts/scgpt_annotation/best_model")

Comparison:

from scdlkit import compare_models

benchmark = compare_models(
    adata,
    models=["autoencoder", "vae", "transformer_ae"],
    task="representation",
    shared_kwargs={"epochs": 10, "label_key": "cell_type"},
    output_dir="artifacts/compare",
)

Supported models

  • autoencoder
  • vae
  • denoising_autoencoder
  • transformer_ae
  • mlp_classifier

Supported tasks

  • representation
  • reconstruction
  • classification

Current scope

  • Gene-expression baselines for AnnData workflows
  • Scanpy-first tutorial and downstream embedding usage
  • Built-in deep-learning baselines plus classical comparison context in notebooks
  • Adapter-based custom PyTorch model support through Trainer
  • Experimental scGPT frozen embedding support for human PBMC workflows
  • Experimental scGPT annotation fine-tuning for labeled human PBMC workflows through Trainer
  • Experimental wrapper-first scGPT dataset adaptation for users who want a simpler compare-predict-save loop

Broader foundation-model support, full-backbone fine-tuning, spatial omics, and multimodal workflows remain future work once the gene-expression toolkit quality gates stay stable.

Documentation

Project documentation is published as a Sphinx-based scientific docs site:

GitHub Pages setup

The docs workflow expects GitHub Pages to be enabled once at the repository level.

  1. Open Settings -> Pages for this repo: https://github.com/uddamvathanak/scDLKit/settings/pages
  2. Under Build and deployment, set Source to GitHub Actions.
  3. Save the setting.
  4. Re-run the docs workflow.

Without that one-time setting, GitHub returns a 404 when actions/configure-pages or actions/deploy-pages tries to access the Pages site.

Optional automatic Pages enablement

If you want the workflow to bootstrap Pages automatically instead of doing the one-time manual setup:

  1. Create a repository secret named PAGES_ENABLEMENT_TOKEN.
  2. Use a Personal Access Token with repo scope or Pages write permission.
  3. Re-run the docs workflow.

Release flow

  • Stage to TestPyPI first with release-testpypi.yml.
  • Publish the final release from a v* tag with release.yml.
  • Use trusted publishing instead of long-lived PyPI API tokens.
  • See RELEASING.md for the full checklist.

Examples

  • examples/train_vae_pbmc.ipynb is the primary Scanpy-first notebook tutorial.
  • examples/compare_models_pbmc.ipynb compares autoencoder, vae, and transformer_ae on PBMC data.
  • examples/classification_demo.ipynb covers the mlp_classifier workflow and confusion-matrix reporting.
  • examples/first_run_synthetic.ipynb is the secondary smoke-test notebook with minimal setup.
  • examples/first_run_synthetic.py is the secondary smoke-test script.

Roadmap

Immediate roadmap target:

  • keep the built-in TaskRunner story stable for bundled baselines
  • preserve adapter-first custom-model support through Trainer
  • keep the experimental scGPT frozen-embedding and annotation-tuning paths narrow and inspectable
  • expand experimental adaptation workflows cautiously without broadening the toolkit too early

Released so far:

v0.1

  • Expanded core workflow with training, evaluation, reporting, and plotting.
  • Staged TestPyPI and PyPI publishing.
  • Cross-platform smoke validation and reproducible notebooks.

Later:

  • broader foundation-model fine-tuning beyond annotation once the experimental scGPT path is stable
  • spatial baselines only after the gene-expression toolkit is stable

Citation

If you use scDLKit, cite the software entry in CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scdlkit-0.1.6.tar.gz (67.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scdlkit-0.1.6-py3-none-any.whl (68.9 kB view details)

Uploaded Python 3

File details

Details for the file scdlkit-0.1.6.tar.gz.

File metadata

  • Download URL: scdlkit-0.1.6.tar.gz
  • Upload date:
  • Size: 67.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scdlkit-0.1.6.tar.gz
Algorithm Hash digest
SHA256 68d88849147d10e4417a3f3c7a61dc74e1941be3ce0614452a1e0c085213286d
MD5 ea4f57bec99937ef7edb8b41bd93d398
BLAKE2b-256 bb55645ae4f3cc5aec16f30a1f9da79833c07dc88e697ae3ef403ab7d56b5a8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scdlkit-0.1.6.tar.gz:

Publisher: release.yml on uddamvathanak/scDLKit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scdlkit-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: scdlkit-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 68.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scdlkit-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 09d8bb43930440f18d8f6f59fffc397064f4656fa4ee29c877c4b26bae269ab4
MD5 251d2d19b59fde08a96e5fbc19fc401b
BLAKE2b-256 36b8342c46e6778bd3f0d96d915502f4c581fc855e2b2455b8bb9a1d7f3d87d6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scdlkit-0.1.6-py3-none-any.whl:

Publisher: release.yml on uddamvathanak/scDLKit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page