Single Cell GeneVector Library
Project description
Installation
Install using pip
pip install genevector
or install from Github
python3 -m venv gvenv
source gvenv/bin/activate
python3 pip install -r requirements.txt
python3 setup.py install
Tutorials (see "example" directory)
- PBMC workflow: Identification of interferon stimulated metagene and cell type annotation.
- TICA workflow: Cell type assignment.
- SPECTRUM workflow: Vector arithmetic for site specific metagenes.
- FITNESS workflow: Identifying increasing metagenes in time series.
GeneVector Workflow
Loading scanpy dataset into GeneVector.
GeneVector makes use of Scanpy anndata objects and requires that the raw count data be loaded into .X matrix. It is highly recommended to subset genes using the seurat_v3 flavor in Scanpy. The device="cuda"
flag should be omitted if there is no GPU available. All downstream analysis requires a GeneVectorDataset object.
from genevector.data import GeneVectorDataset
from genevector.model import GeneVectorTrainer
from genevector.embedding import GeneEmbedding, CellEmbedding
import scanpy as sc
dataset = GeneVectorDataset(adata, device="cuda")
Training GeneVector
After loading the expression, creating a GeneVector object will compute the mutual information between genes (can take up to 15 min for a dataset of 250k cells). This object is only required if you wish to train a model. Model training times vary depending on datasize. The 10k PBMC dataset can be trained in less than five minutes. emb_dimension
sets the size of the learned gene vectors. Smaller values decrease training time, but values smaller than 50 may not provide optimal results.
cmps = GeneVector(dataset,
output_file="genes.vec",
emb_dimension=100,
threshold=1e-6,
device="cuda")
cmps.train(1000, threshold=1e-6) # run for 1000 iterations or loss delta below 1e-6.
To understand model convergence, a loss plot by epoch can be generated.
cmps.plot()
Loading Gene Embedding
After training, two vector files are produced (for input and output weights). It is recommended to take the average of both weights vector="average"
). The GeneEmbedding class has several important analysis and visualization methods listed below.
gembed = GeneEmbedding("genes.vec", dataset, vector="average")
1. Computing gene similarities
A pandas dataframe can be generated using compute_similarities
that includes the most similar genes and their cosine similarities for a given gene query. A barplot figure with a specified number of the most similar genes can be generated using plots_similarities
.
df = gembed.compute_similarities("CD8A")
gembed.plot_similarities("CD8A",n_genes=10)
2. Generating Metagenes
get_adata
produces and AnnData object that houses the gene embedding. This allows the use of Scanpy and AnnData visualization functions. The resolution parameter is passed directly to sc.tl.leiden
to cluster the co-expression graph. get_metagenes
returns a dictionary that stores each metagene as a list associated with an id. For a given id, the metagene can be visualized on a UMAP embedding using the plot_metagene
function.
gdata = embed.get_adata(resolution=40)
metagenes = embed.get_metagenes(gdata)
embed.plot_metagene(gdata, mg=isg_metagene)
Loading the Cell Embedding
Using the GeneEmbedding object and the GeneVectorDataset object, a CellEmbedding object can be instantiated and used to produce a Scanpy AnnData object with get_adata
. This method stores cell embedding under X_genevector
in layers and generates a UMAP embedding using Scanpy. Scanpy functionality can be used to visualize UMAPS (sc.pl.umap
).
cembed = CellEmbedding(dataset, embed)
adata = cembed.get_adata()
The cell embedding can be batch corrected using cembed.batch_correct
. The user is required to select a valid column present in the obs dataframe and specify a reference label. This is a very fast operation.
cembed.batch_correct(column="sample",reference="control")
adata = cembed.get_adata()
Scoring Cells by Metagene
To score expression for each metagene against all cells, we can call the GeneEmbedding function score_metagenes
with the cell-based AnnData object. To plot a heatmap of all metagenes over a set of cell labels, use the plot_metagenes_scores
function. Metagenes are scored with the Scanpy sc.tl.score_genes
function.
embed.score_metagenes(adata, metagenes)
embed.plot_metagenes_scores(adata,mgs,"cell_type")
Performing Cell Type Assignment
Using a dictionary of cell type annotations to marker genes, each cell can be classified using the CellEmbedding function phenotype_probability
. This function returns a new annotated AnnData object, where the resulting classification can be found in .obs["genevector"]
(the user can also supply a column name using column=
). A separate column in the obs dataframe is created to hold the pseudo-probabilities for each cell type. These probabilties can be shown on a UMAP using standard the Scanpy function sc.pl.umap
.
markers = dict()
markers["T Cell"] = ["CD3D","CD3G","CD3E","TRAC","IL32","CD2"]
markers["B/Plasma"] = ["CD79A","CD79B","MZB1","CD19","BANK1"]
markers["Myeloid"] = ["LYZ","CST3","AIF1","CD68","C1QA","C1QB","C1QC"]
annotated_adata = cembed.phenotype_probability(adata,markers)
prob_cols = [x for x in annotated_adata.obs.columns.tolist() if "Pseudo-probability" in x]
sc.pl.umap(annotated_adata,color=prob_cols+["genevector"],size=25)
All additional analyses described in the manuscript, including comparisons to LDVAE and CellAssign, can be found in Jupyter notebooks in the examples directory. Results were computed for v0.0.1.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file genevector-0.2.2.tar.gz
.
File metadata
- Download URL: genevector-0.2.2.tar.gz
- Upload date:
- Size: 54.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f10fdf7a0c30b6b2bdc6f8be9d0f8bcfd62bf1ebaf88a82f721be9f8eeb1cea |
|
MD5 | ec666689fd1c1010c7a1f46e33ef355f |
|
BLAKE2b-256 | 571a927e17e263d2abfdaeb207fba51b2612bbe879e7e4d9b606b800e8de870f |
File details
Details for the file genevector-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: genevector-0.2.2-py3-none-any.whl
- Upload date:
- Size: 16.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 442c6b33152152a797103597d2c82686abe25ba8e32a93774afa878e4ad707f3 |
|
MD5 | d118092c7a27fb3c30bb1fb4e6328e4a |
|
BLAKE2b-256 | e0884c0a4fdac6cce5909fa54e1b2e3e1834b460c2e8ad9ccbe114db1a5fce4a |