Skip to main content

Automatic annotation of single cell data using a labelled reference dataset including various methods and giving certainty across those methods.

Project description

Stars PyPI Build Status Coverage Code Style Downloads

PopV

PopV uses popular vote of a variety of cell-type transfer tools to classify cell-types in a query dataset based on a test dataset. Using this variety of algorithms, we compute the agreement between those algorithms and use this agreement to predict which cell-types are with a high likelihood the same cell-types observed in the reference.

Algorithms

Currently implemented algorithms are:

  • K-nearest neighbor classification after dataset integration with BBKNN
  • K-nearest neighbor classification after dataset integration with SCANORAMA
  • K-nearest neighbor classification after dataset integration with scVI
  • Random forest classification
  • Support vector machine classification
  • OnClass cell type classification
  • scANVI label transfer
  • Celltypist cell type classification

All algorithms are implemented as a class in popv/algorithms. To implement a new method, a class has to have several methods:

  • algorithm.compute_integration: Computes dataset integration to yield an integrated latent space.
  • algorithm.predict: Computes cell-type labels based on the specific classifier.
  • algorithm.compute_embedding: Computes UMAP embedding of previously computed integrated latent space.

We highlight the implementation of a new classifier in a scaffold. Adding a new class with those methods will automatically tell PopV to include this class into its classifiers and will use the new classifier as another expert.

All algorithms that allow for pre-training are pre-trained. This excludes by design BBKNN and SCANORAMA as both construct a new embedding space. Pretrained models are stored on (Zenodo)[https://zenodo.org/record/7580707] and are automatically downloaded in the Colab notebook linked below. We encourage pre-training models when implementing new classes.

All input parameters are defined during initial call to Process_Query and are stored in the unstructured field of the generated AnnData object. PopV has three levels of prediction complexities:

  • retrain will train all classifiers from scratch. For 50k cells this takes up to an hour of computing time using a GPU.
  • inference will use pretrained classifiers to annotate query as well as reference cells and construct a joint embedding using all integration methods from above. For 50k cells this takes in our hands up to half an hour of computing time using a GPU.
  • fast will use only methods with pretrained classifiers to annotate only query cells. For 50k cells this takes 5 minutes without a GPU (without UMAP embedding).

A user-defined selection of classification algorithms can be defined when calling annotate_data. Additionally advanced users can define here non-standard parameters for the integration methods as well as the classifiers.

Output

PopV will output a cell-type classification for each of the used classifiers, as well as the majority vote across all classifiers. Additionally, PopV uses the ontology to go through the full ontology descendants for the OnClass prediction (disabled in fast mode). This method will be further described when PopV is published. PopV additionally outputs a score, which counts the number of classifiers that agreed upon the PopV prediction. This can be seen as the certainty that the current prediction is correct for every single cell in the query data. We generally found disagreement of a single expert to be still highly reliable while disagreement of more than 2 classifiers signifies less reliable results. The aim of PopV is not to fully annotate a data set but to highlight cells that potentially benefit from further manual careful annotation. Additionally, PopV outputs UMAP embeddings of all integrated latent spaces if compute_embedding==True in Process_Query and computes certainties for every used classifier if return_probabilities==True in Process_Query.

Installation

We suggest using a package manager like conda or mamba to install the package. OnClass files for annotation based on Tabula sapiens are deposited in popv/ontology. We use Cell Ontology as an ontology throughout our experiments. PopV will automatically look for the ontology in this folder. If you want to provide your user-edited ontology, we will provide notebooks to create the Natural Language Model used in OnClass for this user-defined ontology.

conda create -n yourenv python=3.8
conda activate yourenv
pip install git+https://github.com/czbiohub/PopV

Example notebook

We deposited an example notebook in Google Colab:

This notebook will guide you through annotating a dataset based on the annotated Tabula sapiens reference and demonstrates how to run annotation on your own query dataset. This notebook requires that all cells are annotated based on a cell ontology. We strongly encourage the use of a common cell ontology, see also Osumi-Sutherland et al. Using a cell ontology is a requirement to run OnClass as a prediction algorithm.

However, for other organisms than human no ontology exists. We therefore allow running PopV without using a cell ontology. A second notebook highlighting using PopV without an existing ontology is currently planned and will be released here.

Memory requirements exceed the free limit in Colab and we recommend a Pro access to run the notebook.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

popv-0.2.1.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

popv-0.2.1-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file popv-0.2.1.tar.gz.

File metadata

  • Download URL: popv-0.2.1.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.8.15 Linux/5.15.0-69-generic

File hashes

Hashes for popv-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0ea3e027bba6a091b9cc29df13e355e0dbf5eb66b1d2a40f279e3c44b1be603b
MD5 b45f475bc6023417cb825c68a1c24526
BLAKE2b-256 d6a6b9b5710eb71e3e8c173128ddb7215e4e3905990ae25efb4b8ada05e05bcf

See more details on using hashes here.

File details

Details for the file popv-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: popv-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.3.1 CPython/3.8.15 Linux/5.15.0-69-generic

File hashes

Hashes for popv-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8e47e62185cb9234428d5ca1930d9767de8db5b126e3d37c464aa5ce0a245d49
MD5 589a5e6f454c61048392a784db2722cd
BLAKE2b-256 44f9f4d27a9cbff6f242683bdc0f5933adb3056eec87486333afdc39b8d5d5a9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page