Skip to main content

Single Cell GeneVector Library

Project description

alt text alt text

Installation

Install using pip

pip install genevector

or install from Github

python3 -m venv gvenv
source gvenv/bin/activate
python3 pip install -r requirements.txt
python3 setup.py install

Tutorials (see "example" directory)

  1. PBMC workflow: Identification of interferon stimulated metagene and cell type annotation.
  2. TICA workflow: Cell type assignment.
  3. SPECTRUM workflow: Vector arithmetic for site specific metagenes.
  4. FITNESS workflow: Identifying increasing metagenes in time series.

GeneVector Workflow

Loading scanpy dataset into GeneVector.

GeneVector makes use of Scanpy anndata objects and requires that the raw count data be loaded into .X matrix. It is highly recommended to subset genes using the seurat_v3 flavor in Scanpy. The device="cuda" flag should be omitted if there is no GPU available. All downstream analysis requires a GeneVectorDataset object.

from genevector.data import GeneVectorDataset
from genevector.model import GeneVectorTrainer
from genevector.embedding import GeneEmbedding, CellEmbedding

import scanpy as sc

dataset = GeneVectorDataset(adata, device="cuda")

Training GeneVector

After loading the expression, creating a GeneVector object will compute the mutual information between genes (can take up to 15 min for a dataset of 250k cells). This object is only required if you wish to train a model. Model training times vary depending on datasize. The 10k PBMC dataset can be trained in less than five minutes. emb_dimension sets the size of the learned gene vectors. Smaller values decrease training time, but values smaller than 50 may not provide optimal results.

cmps = GeneVector(dataset,
                  output_file="genes.vec",
                  emb_dimension=100,
                  threshold=1e-6,
                  device="cuda")
cmps.train(1000, threshold=1e-6) # run for 1000 iterations or loss delta below 1e-6.

To understand model convergence, a loss plot by epoch can be generated.

cmps.plot()

Loading Gene Embedding

After training, two vector files are produced (for input and output weights). It is recommended to take the average of both weights vector="average"). The GeneEmbedding class has several important analysis and visualization methods listed below.

gembed = GeneEmbedding("genes.vec", dataset, vector="average")

1. Computing gene similarities

A pandas dataframe can be generated using compute_similarities that includes the most similar genes and their cosine similarities for a given gene query. A barplot figure with a specified number of the most similar genes can be generated using plots_similarities.

df = gembed.compute_similarities("CD8A")
gembed.plot_similarities("CD8A",n_genes=10)

2. Generating Metagenes

get_adata produces and AnnData object that houses the gene embedding. This allows the use of Scanpy and AnnData visualization functions. The resolution parameter is passed directly to sc.tl.leiden to cluster the co-expression graph. get_metagenes returns a dictionary that stores each metagene as a list associated with an id. For a given id, the metagene can be visualized on a UMAP embedding using the plot_metagene function.

gdata = embed.get_adata(resolution=40)
metagenes = embed.get_metagenes(gdata)
embed.plot_metagene(gdata, mg=isg_metagene)

Loading the Cell Embedding

Using the GeneEmbedding object and the GeneVectorDataset object, a CellEmbedding object can be instantiated and used to produce a Scanpy AnnData object with get_adata. This method stores cell embedding under X_genevector in layers and generates a UMAP embedding using Scanpy. Scanpy functionality can be used to visualize UMAPS (sc.pl.umap).

cembed = CellEmbedding(dataset, embed)
adata = cembed.get_adata()

The cell embedding can be batch corrected using cembed.batch_correct. The user is required to select a valid column present in the obs dataframe and specify a reference label. This is a very fast operation.

cembed.batch_correct(column="sample",reference="control")
adata = cembed.get_adata()

Scoring Cells by Metagene

To score expression for each metagene against all cells, we can call the GeneEmbedding function score_metagenes with the cell-based AnnData object. To plot a heatmap of all metagenes over a set of cell labels, use the plot_metagenes_scores function. Metagenes are scored with the Scanpy sc.tl.score_genes function.

embed.score_metagenes(adata, metagenes)
embed.plot_metagenes_scores(adata,mgs,"cell_type")

Performing Cell Type Assignment

Using a dictionary of cell type annotations to marker genes, each cell can be classified using the CellEmbedding function phenotype_probability. This function returns a new annotated AnnData object, where the resulting classification can be found in .obs["genevector"] (the user can also supply a column name using column=). A separate column in the obs dataframe is created to hold the pseudo-probabilities for each cell type. These probabilties can be shown on a UMAP using standard the Scanpy function sc.pl.umap.

markers = dict()
markers["T Cell"] = ["CD3D","CD3G","CD3E","TRAC","IL32","CD2"]
markers["B/Plasma"] = ["CD79A","CD79B","MZB1","CD19","BANK1"]
markers["Myeloid"] = ["LYZ","CST3","AIF1","CD68","C1QA","C1QB","C1QC"]

annotated_adata = cembed.phenotype_probability(adata,markers)

prob_cols = [x for x in annotated_adata.obs.columns.tolist() if "Pseudo-probability" in x]
sc.pl.umap(annotated_adata,color=prob_cols+["genevector"],size=25)
All additional analyses described in the manuscript, including comparisons to LDVAE and CellAssign, can be found in Jupyter notebooks in the examples directory. Results were computed for v0.0.1.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genevector-0.2.2.tar.gz (54.5 MB view details)

Uploaded Source

Built Distribution

genevector-0.2.2-py3-none-any.whl (16.6 kB view details)

Uploaded Python 3

File details

Details for the file genevector-0.2.2.tar.gz.

File metadata

  • Download URL: genevector-0.2.2.tar.gz
  • Upload date:
  • Size: 54.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for genevector-0.2.2.tar.gz
Algorithm Hash digest
SHA256 7f10fdf7a0c30b6b2bdc6f8be9d0f8bcfd62bf1ebaf88a82f721be9f8eeb1cea
MD5 ec666689fd1c1010c7a1f46e33ef355f
BLAKE2b-256 571a927e17e263d2abfdaeb207fba51b2612bbe879e7e4d9b606b800e8de870f

See more details on using hashes here.

File details

Details for the file genevector-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: genevector-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 16.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for genevector-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 442c6b33152152a797103597d2c82686abe25ba8e32a93774afa878e4ad707f3
MD5 d118092c7a27fb3c30bb1fb4e6328e4a
BLAKE2b-256 e0884c0a4fdac6cce5909fa54e1b2e3e1834b460c2e8ad9ccbe114db1a5fce4a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page