AnnData-native deep learning baselines for single-cell data.
Project description
scDLKit
Train, evaluate, compare, and visualize baseline deep-learning models for single-cell data without writing PyTorch from scratch.
Quick Start
Start here if you want the shortest path from AnnData to a learned embedding and, for reconstruction-capable models, predicted or reconstructed gene-expression values:
- load an
AnnData - fit a baseline model
- get the learned embedding
- optionally get predicted or reconstructed gene-expression values
- continue in Scanpy
import scanpy as sc
from scdlkit import TaskRunner
adata = sc.datasets.pbmc3k_processed()
runner = TaskRunner(
model="vae",
task="representation",
label_key="louvain",
device="auto",
epochs=20,
batch_size=128,
model_kwargs={"kl_weight": 1e-3},
)
runner.fit(adata)
# Cell embedding for downstream Scanpy analysis.
adata.obsm["X_scdlkit_vae"] = runner.encode(adata)
# Predicted / reconstructed gene expression for reconstruction-capable models.
predicted_expression = runner.reconstruct(adata)
Then keep the normal Scanpy path:
sc.pp.neighbors(adata, use_rep="X_scdlkit_vae")
sc.tl.umap(adata)
sc.pl.umap(adata, color="louvain")
Notes:
runner.encode(...)returns the latent embedding.runner.reconstruct(...)returns reconstructed gene-expression values for reconstruction-capable models such asautoencoder,vae,denoising_autoencoder, andtransformer_ae.runner.predict(...)remains backward compatible, butreconstruct(...)is the clearer public path for reconstructed expression.- Classification models return class predictions instead of reconstructed expression.
- Frozen scGPT in the experimental foundation path exposes embeddings only, not reconstructed expression.
What you get from this quickstart:
- a learned embedding in
adata.obsm - reconstructed gene-expression values when the model supports them
- training metrics and saved reports
- a direct continuation path into Scanpy
Start Here
- Documentation site: https://uddamvathanak.github.io/scDLKit/
- Primary notebook tutorial:
examples/train_vae_pbmc.ipynb - Install path for tutorials:
python -m pip install "scdlkit[tutorials]" - Experimental foundation path:
python -m pip install "scdlkit[foundation,tutorials]" - CPU and GPU use the same notebook path through
device="auto" - Core learning path: quickstart -> downstream Scanpy -> comparison -> reconstruction sanity check
- Secondary notebooks:
examples/compare_models_pbmc.ipynb,examples/classification_demo.ipynb - Downstream Scanpy notebook:
examples/downstream_scanpy_after_scdlkit.ipynb - Reconstruction notebook:
examples/reconstruction_sanity_pbmc.ipynb - Custom model notebook:
examples/custom_model_extension.ipynb - Experimental foundation notebook:
examples/scgpt_pbmc_embeddings.ipynb - Experimental annotation fine-tuning notebook:
examples/scgpt_cell_type_annotation.ipynb - Synthetic smoke examples:
examples/first_run_synthetic.ipynb,examples/first_run_synthetic.py
Why scDLKit
- AnnData-native workflow for single-cell users.
- Baseline-first model zoo: AE, VAE, DAE, Transformer AE, and MLP classification.
- Built-in training, evaluation, comparison, and plotting.
- Reproducible reports and notebooks for portfolio-ready demonstrations.
- Built-in benchmark gates on small Scanpy datasets before tutorial defaults change.
- Gene-expression-focused scope while the core toolkit stabilizes.
- Experimental frozen scGPT embeddings for human PBMC workflows.
- Experimental scGPT annotation fine-tuning with head-only and LoRA strategies.
Supported platforms
- Linux: supported
- macOS: supported
- Windows: supported
Installation
Primary tutorial install path:
python -m pip install "scdlkit[tutorials]"
Windows note: if you install into a deeply nested virtual environment path, Jupyter dependencies can hit Windows path-length limits. Use a short environment path such as C:\venvs\scdlkit, or enable Windows Long Paths if needed.
Optional extras:
python -m pip install "scdlkit[scanpy]"
python -m pip install "scdlkit[notebook]"
python -m pip install "scdlkit[foundation]"
python -m pip install scdlkit
python -m pip install "scdlkit[dev,docs]"
For GPU users, install the matching PyTorch build first using the official selector:
Then install scdlkit[tutorials]. The same notebook examples run on CPU or GPU with device="auto".
Scanpy Quickstart
Primary tutorial example. The notebook uses a quickstart profile by default and exposes a full profile in its first config cell:
quickstart: CPU-friendly, docs-friendly, reproduciblefull: longer run for stronger qualitative separation
For the PBMC quickstart, use a light VAE KL term so the latent UMAP preserves broad cell-type structure instead of collapsing into a uniform blob. A healthy result should show broad cell-type groups as visibly separated regions rather than a single mixed cloud.
Notebook-First Examples
Most researchers should start with the Scanpy PBMC quickstart:
python -m pip install "scdlkit[tutorials]"
jupyter notebook examples/train_vae_pbmc.ipynb
This notebook:
- loads PBMC data through Scanpy
- trains a VAE baseline with scDLKit
- writes the latent representation into
adata.obsm - continues with Scanpy neighbors and UMAP
- points to the downstream Scanpy and reconstruction tutorials for the next interpretation steps
- explains the quickstart versus full tutorial profiles
- works on CPU or GPU through
device="auto"
Additional Scanpy-first notebooks:
examples/downstream_scanpy_after_scdlkit.ipynb: take the scDLKit embedding through Leiden clustering, marker ranking, dotplots, and coarse annotationexamples/compare_models_pbmc.ipynb: comparePCA,autoencoder,vae, andtransformer_aeexamples/reconstruction_sanity_pbmc.ipynb: inspect reconstructed gene-expression outputs with a dedicated reconstruction baselineexamples/classification_demo.ipynb: run themlp_classifierbaseline and inspect a confusion matrixexamples/custom_model_extension.ipynb: wrap a raw PyTorch autoencoder and train it throughTrainerexamples/scgpt_pbmc_embeddings.ipynb: run the experimental frozenwhole-humanscGPT embedding workflow and return to Scanpy throughadata.obsmexamples/scgpt_cell_type_annotation.ipynb: comparePCA + logistic regression, frozen scGPT, head-only tuning, and LoRA tuning for labeled PBMC annotation
The synthetic notebook and script are still available, but they are now the smoke-test path rather than the primary researcher onboarding flow:
python -m pip install "scdlkit[notebook]"
jupyter notebook examples/first_run_synthetic.ipynb
python examples/first_run_synthetic.py
These write small reproducible artifacts to artifacts/first_run_notebook/ and artifacts/first_run/.
Optional contributor Conda environment
Conda is kept for contributors and demos. It is not the primary public install path.
Official installers:
- Miniconda install guide: https://www.anaconda.com/docs/getting-started/miniconda/install
- Anaconda Distribution download: https://www.anaconda.com/download
From the repo root:
conda env create -f environment.yml
conda activate scdlkit
Core APIs
High-level:
from scdlkit import TaskRunner
Lower-level:
from scdlkit import Trainer, create_model, prepare_data
Custom-model adapters:
from scdlkit.adapters import wrap_classification_module, wrap_reconstruction_module
Custom wrapped models are supported through Trainer first. TaskRunner remains the built-in high-level path for bundled scDLKit models.
Experimental foundation helpers:
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data
Experimental scGPT annotation tuning:
from scdlkit.foundation import (
load_scgpt_annotation_model,
prepare_scgpt_data,
split_scgpt_data,
)
Comparison:
from scdlkit import compare_models
benchmark = compare_models(
adata,
models=["autoencoder", "vae", "transformer_ae"],
task="representation",
shared_kwargs={"epochs": 10, "label_key": "cell_type"},
output_dir="artifacts/compare",
)
Supported models
autoencodervaedenoising_autoencodertransformer_aemlp_classifier
Supported tasks
representationreconstructionclassification
Current scope
- Gene-expression baselines for AnnData workflows
- Scanpy-first tutorial and downstream embedding usage
- Built-in deep-learning baselines plus classical comparison context in notebooks
- Adapter-based custom PyTorch model support through
Trainer - Experimental scGPT frozen embedding support for human PBMC workflows
- Experimental scGPT annotation fine-tuning for labeled human PBMC workflows through
Trainer
Broader foundation-model support, full-backbone fine-tuning, spatial omics, and multimodal workflows remain future work once the gene-expression toolkit quality gates stay stable.
Documentation
Project documentation is published as a Sphinx-based scientific docs site:
- Docs site: https://uddamvathanak.github.io/scDLKit/
- Tutorials: Scanpy-first notebook walkthroughs rendered in the docs site
- API reference:
docs/api/index.md - Example notebooks:
examples/
GitHub Pages setup
The docs workflow expects GitHub Pages to be enabled once at the repository level.
- Open
Settings -> Pagesfor this repo:https://github.com/uddamvathanak/scDLKit/settings/pages - Under
Build and deployment, setSourcetoGitHub Actions. - Save the setting.
- Re-run the
docsworkflow.
Without that one-time setting, GitHub returns a 404 when actions/configure-pages or actions/deploy-pages tries to access the Pages site.
Optional automatic Pages enablement
If you want the workflow to bootstrap Pages automatically instead of doing the one-time manual setup:
- Create a repository secret named
PAGES_ENABLEMENT_TOKEN. - Use a Personal Access Token with
reposcope or Pages write permission. - Re-run the
docsworkflow.
Release flow
- Stage to TestPyPI first with
release-testpypi.yml. - Publish the final release from a
v*tag withrelease.yml. - Use trusted publishing instead of long-lived PyPI API tokens.
- See
RELEASING.mdfor the full checklist.
Examples
examples/train_vae_pbmc.ipynbis the primary Scanpy-first notebook tutorial.examples/compare_models_pbmc.ipynbcomparesautoencoder,vae, andtransformer_aeon PBMC data.examples/classification_demo.ipynbcovers themlp_classifierworkflow and confusion-matrix reporting.examples/first_run_synthetic.ipynbis the secondary smoke-test notebook with minimal setup.examples/first_run_synthetic.pyis the secondary smoke-test script.
Roadmap
Immediate roadmap target:
- keep the built-in
TaskRunnerstory stable for bundled baselines - preserve adapter-first custom-model support through
Trainer - keep the experimental scGPT frozen-embedding and annotation-tuning paths narrow and inspectable
- expand experimental adaptation workflows cautiously without broadening the toolkit too early
Released so far:
v0.1
- Expanded core workflow with training, evaluation, reporting, and plotting.
- Staged TestPyPI and PyPI publishing.
- Cross-platform smoke validation and reproducible notebooks.
Later:
- broader foundation-model fine-tuning beyond annotation once the experimental scGPT path is stable
- spatial baselines only after the gene-expression toolkit is stable
Citation
If you use scDLKit, cite the software entry in CITATION.cff.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scdlkit-0.1.5.tar.gz.
File metadata
- Download URL: scdlkit-0.1.5.tar.gz
- Upload date:
- Size: 52.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea5f154bd774ebfc074275db835a6045897bd455e496a1728df5de33fa678f44
|
|
| MD5 |
1d9b90e61c7258a4190bdb2702d1308b
|
|
| BLAKE2b-256 |
a982fa675a2d98b734c8c86fd257588dad4e97844c0971fb3955ea4a2ed4faf4
|
Provenance
The following attestation bundles were made for scdlkit-0.1.5.tar.gz:
Publisher:
release.yml on uddamvathanak/scDLKit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scdlkit-0.1.5.tar.gz -
Subject digest:
ea5f154bd774ebfc074275db835a6045897bd455e496a1728df5de33fa678f44 - Sigstore transparency entry: 1135027286
- Sigstore integration time:
-
Permalink:
uddamvathanak/scDLKit@2c93a07e779e8db568025abc8561e3ca721ef51e -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/uddamvathanak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2c93a07e779e8db568025abc8561e3ca721ef51e -
Trigger Event:
push
-
Statement type:
File details
Details for the file scdlkit-0.1.5-py3-none-any.whl.
File metadata
- Download URL: scdlkit-0.1.5-py3-none-any.whl
- Upload date:
- Size: 55.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8be821cc147f4cd25703d736ae8a117a600502a0c15d2f2a5f2d814b4c996598
|
|
| MD5 |
66292d9e3769bcfa2328a8d18ce601e4
|
|
| BLAKE2b-256 |
b04386dd9498f821987532d5f8b15ed437bb539996162b7884fe3d0bbdf4e145
|
Provenance
The following attestation bundles were made for scdlkit-0.1.5-py3-none-any.whl:
Publisher:
release.yml on uddamvathanak/scDLKit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scdlkit-0.1.5-py3-none-any.whl -
Subject digest:
8be821cc147f4cd25703d736ae8a117a600502a0c15d2f2a5f2d814b4c996598 - Sigstore transparency entry: 1135027325
- Sigstore integration time:
-
Permalink:
uddamvathanak/scDLKit@2c93a07e779e8db568025abc8561e3ca721ef51e -
Branch / Tag:
refs/tags/v0.1.5 - Owner: https://github.com/uddamvathanak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2c93a07e779e8db568025abc8561e3ca721ef51e -
Trigger Event:
push
-
Statement type: