Skip to main content

Batch balanced KNN

Project description

Batch balanced KNN

BBKNN is a fast and intuitive batch effect removal tool that can be directly used in the scanpy workflow. It serves as an alternative to scanpy.pp.neighbors(), with both functions creating a neighbour graph for subsequent use in clustering, pseudotime and UMAP visualisation. The standard approach begins by identifying the k nearest neighbours for each individual cell across the entire data structure, with the candidates being subsequently transformed to exponentially related connectivities before serving as the basis for further analyses. If technical artifacts (be they because of differing data acquisition technologies, protocol alterations or even particularly severe operator effects) are present in the data, they will make it challenging to link corresponding cell types across different batches.

KNN

As such, BBKNN actively combats this effect by taking each cell and identifying a (smaller) k nearest neighbours in each batch separately, rather than the dataset as a whole. These nearest neighbours for each batch are then merged into a final neighbour list for the cell. This helps create connections between analogous cells in different batches without altering the counts or PCA space.

BBKNN

Citation

If you use BBKNN in your work, please cite the paper:

@article{polanski2019bbknn,
  title={BBKNN: Fast Batch Alignment of Single Cell Transcriptomes},
  author={Pola{\'n}ski, Krzysztof and Young, Matthew D and Miao, Zhichao and Meyer, Kerstin B and Teichmann, Sarah A and Park, Jong-Eun},
  doi={10.1093/bioinformatics/btz625},
  journal={Bioinformatics},
  year={2019}
}

Installation

BBKNN depends on Cython, numpy, scipy, annoy, pynndescent, umap-learn and scikit-learn. The package is available on pip and conda, and can be easily installed as follows:

pip3 install bbknn

or

conda install -c bioconda bbknn

BBKNN can also make use of faiss. Consult the official installation instructions, the easiest way to get it is via conda.

Usage and Documentation

BBKNN has the option to immediately slot into the spot occupied by scanpy.neighbors() in the Seurat-inspired scanpy workflow. It computes a batch aligned variant of the neighbourhood graph, with its uses within scanpy including clustering, diffusion map pseudotime inference and UMAP visualisation. The basic syntax to run BBKNN on scanpy's AnnData object (with PCA computed via scanpy.tl.pca()) is as follows:

import bbknn

bbknn.bbknn(adata)

You can provide which adata.obs column to use for batch discrimination via the batch_key parameter. This defaults to 'batch', which is created by scanpy when you merge multiple AnnData objects (e.g. if you were to import multiple samples separately and then concatenate them).

Integration can be improved by using ridge regression on both a technical effect and a biological grouping prior to BBKNN, following a workflow from Park et al., 2020. In the event of not having a biological grouping at hand, a coarse clustering obtained from a BBKNN-corrected graph can be used in its place. This creates the following basic workflow syntax:

import bbknn
import scanpy

bbknn.bbknn(adata)
scanpy.tl.leiden(adata)
bbknn.ridge_regression(adata, batch_key=['batch'], confounder_key=['leiden'])
scanpy.tl.pca(adata)
bbknn.bbknn(adata)

Alternately, you can just provide a PCA matrix with cells as rows and a matching vector of batch assignments for each of the cells and call BBKNN as follows (with connectivities being the primary graph output of interest):

import bbknn.matrix

distances, connectivities, parameters = bbknn.matrix.bbknn(pca_matrix, batch_list)

An HTML render of the BBKNN function docstring, detailing all the parameters, can be accessed at ReadTheDocs. BBKNN use, along with using ridge regression to improve the integration, is shown in a demonstration notebook.

BBKNN in R

At this point, there is no plan to create a BBKNN R package. However, it can be ran quite easily via reticulate. Using the base functions is the same as in python. If you're in possession of a PCA matrix and a batch assignment vector and want to get UMAP coordinates out of it, you can use the following code snippet to do so. The weird PCA computation part and replacing it with your original values is unfortunately necessary due to how AnnData innards operate from a reticulate level. Provide your python path in use_python()

library(reticulate)
use_python("/usr/bin/python3")

anndata = import("anndata",convert=FALSE)
bbknn = import("bbknn", convert=FALSE)
sc = import("scanpy",convert=FALSE)

adata = anndata$AnnData(X=pca, obs=batch)
sc$tl$pca(adata)
adata$obsm$X_pca = pca
bbknn$bbknn(adata,batch_key=0)
sc$tl$umap(adata)
umap = py_to_r(adata$obsm[["X_umap"]])

If you wish to change any integer arguments (such as neighbors_within_batch), you'll have to as.integer() the value so python understands it as an integer.

When testing locally, faiss refused to work when BBKNN was reticulated. As such, provide use_faiss=FALSE to the BBKNN call if you run into this problem.

Example Notebooks

demo.ipynb is the main demonstration, applying BBKNN to some pancreas data with a batch effect. The notebook also uses ridge regression to improve the integration.

The BBKNN paper makes use of the following analyses:

  • simulation.ipynb applies BBKNN to simulated data with a known ground truth, and demonstrates the utility of graph trimming by introducing an unrelated cell population. This simulated data is then used to benchmark BBKNN against mnnCorrect, CCA, Scanorama and Harmony in benchmark.ipynb, and then finish off with a benchmarking of a BBKNN variant reluctant to work within R/reticulate and visualise the findings in benchmark2.ipynb. benchmark3-new-R-methods.ipynb adds some newer R approaches to the benchmark.
  • mouse.ipynb runs a collection of murine atlases through BBKNN. mouse-harmony.ipynb applies Harmony to the same data.

The BBKNN preprint performed some additional analyses that got left out of the final manuscript. Archival notebooks are stored in a separate repository.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bbknn-1.6.0.tar.gz (9.5 MB view details)

Uploaded Source

Built Distribution

bbknn-1.6.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file bbknn-1.6.0.tar.gz.

File metadata

  • Download URL: bbknn-1.6.0.tar.gz
  • Upload date:
  • Size: 9.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for bbknn-1.6.0.tar.gz
Algorithm Hash digest
SHA256 1c01a9d6df2fc52a527de8a403617897a4b672724863299a7026f2132f1b041b
MD5 a0787a4dd5c68199ed55bee0a554f4bb
BLAKE2b-256 3cc63885aaefbedc615fc87e6b9a4c89ac621e1e1fea4836018e25e4b1e92ac5

See more details on using hashes here.

File details

Details for the file bbknn-1.6.0-py3-none-any.whl.

File metadata

  • Download URL: bbknn-1.6.0-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.31.0

File hashes

Hashes for bbknn-1.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7668e748a92226eca2e649514192630b60d2e1eda8d98964a146c9be1fadd21c
MD5 5d05ed3823b80bb40574b76d3f36c0cc
BLAKE2b-256 ac17bc4ee1fe8d0e382f480581b2295d81e57c8a304d4c10f2088903ffb37f32

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page