Skip to main content

Single Cell GeneVector Library

Project description

GeneVector

Tests PyPI version Documentation Status License: MIT

GeneVector Logo

GeneVector is a Python library for single-cell RNA sequencing analysis that learns distributed gene representations using a neural embedding approach. It enables gene co-expression analysis, cell type annotation, and metagene discovery through vector arithmetic operations.

Key Features

  • Gene Embeddings: Learn distributed representations of genes based on co-expression patterns
  • Cell Type Annotation: Automated cell type assignment using marker gene sets
  • Metagene Discovery: Identify functionally related gene modules through clustering
  • Vector Arithmetic: Perform gene relationship analysis using vector operations
  • Batch Correction: Fast batch effect correction for multi-sample datasets
  • Pluggable Targets: Train on mutual information, Pearson, Spearman, Jaccard, cosine, or custom metrics
  • High Performance: Numba JIT, Rust native, and CUDA GPU backends for MI computation
  • Smart Caching: Computed scores cached to disk, instant reload on re-runs

Framework Overview

Installation

From PyPI (Recommended)

pip install genevector

With Numba Acceleration (Recommended)

pip install genevector[fast]

From Source

git clone https://github.com/nceglia/genevector.git
cd genevector
pip install -e .

# with numba support
pip install -e ".[fast]"

# with rust backend (requires Rust toolchain)
pip install maturin
maturin develop --release

Dependencies

  • Python ≥ 3.9
  • PyTorch
  • Scanpy
  • NumPy, SciPy, Pandas
  • Matplotlib, Seaborn

Optional: numba for JIT-accelerated MI computation (strongly recommended).

Quick Start

import scanpy as sc
from genevector.data import GeneVectorDataset
from genevector.model import GeneVector
from genevector.embedding import GeneEmbedding, CellEmbedding

# Load your single-cell data
adata = sc.read_h5ad("your_data.h5ad")

# Create dataset (auto-selects fastest MI backend)
dataset = GeneVectorDataset(adata)

# Train the model
model = GeneVector(dataset, output_file="genes.vec", emb_dimension=100)
model.train(1000, threshold=1e-6)

# Load gene embeddings
gene_embed = GeneEmbedding("genes.vec", dataset, vector="average")

# Generate cell embeddings
cell_embed = CellEmbedding(dataset, gene_embed)
adata_embedded = cell_embed.get_adata()

# Analyze gene similarities
similarities = gene_embed.compute_similarities("CD8A")
gene_embed.plot_similarities("CD8A", n_genes=10)

MI Computation Backends

GeneVector computes pairwise co-expression scores between all genes as the training target. For mutual information (the default target), multiple backends are available with automatic dispatch.

Dispatch Priority

When mi_backend="auto" (the default), GeneVector selects the fastest available backend:

Priority Backend Speedup vs Legacy How It Works Availability
1 CUDA GPU ~200-1000x PyTorch scatter-based histograms on GPU Requires device="cuda" and NVIDIA GPU
2 Rust ~100-800x Native compiled with rayon parallelism Requires maturin develop --release build
3 Numba ~100-500x JIT-compiled, parallel across CPU cores Requires pip install numba
4 NumPy ~10-30x Vectorized discretization, Python pair loop Always available (default fallback)

Selecting a Backend

# Auto-select (recommended): picks the fastest available
dataset = GeneVectorDataset(adata, mi_backend="auto")

# Force a specific backend
dataset = GeneVectorDataset(adata, mi_backend="rust")
dataset = GeneVectorDataset(adata, mi_backend="numba")
dataset = GeneVectorDataset(adata, mi_backend="numpy")
dataset = GeneVectorDataset(adata, mi_backend="gpu", device="cuda")

Check Available Backends

from genevector.metrics import HAS_NUMBA

# Rust availability (if built)
try:
    from genevector._rust import compute_mi_pairs
    print("Rust backend: available")
except ImportError:
    print("Rust backend: not installed")

print(f"Numba backend: {'available' if HAS_NUMBA else 'not installed'}")

import torch
print(f"CUDA backend: {'available' if torch.cuda.is_available() else 'not available'}")

Practical Runtimes (Approximate)

For 2,000 genes × 5,000 cells (~2M gene pairs):

Backend Time
Legacy (v0.2) ~26 hours
NumPy ~3 hours
Numba ~2-3 minutes
Rust ~1-2 minutes

Scores are cached to ~/.genevector/cache/ after the first run. Subsequent runs with the same data load instantly.

Co-expression Targets

GeneVector can train on different co-expression metrics beyond mutual information:

# Default: signed mutual information
dataset = GeneVectorDataset(adata, target="mi", signed_mi=True)

# Pearson correlation
dataset = GeneVectorDataset(adata, target="pearson")

# Spearman rank correlation
dataset = GeneVectorDataset(adata, target="spearman")

# Jaccard index (binarized co-detection)
dataset = GeneVectorDataset(adata, target="jaccard")

# Cosine similarity between gene expression vectors
dataset = GeneVectorDataset(adata, target="cosine")

# Custom callable
def my_metric(X, gene_names, **kwargs):
    # must return dict[str, dict[str, float]]
    ...

dataset = GeneVectorDataset(adata, target=my_metric)

The mi_backend parameter only applies when target="mi". The matrix-based targets (Pearson, Spearman, Jaccard, cosine) compute in seconds via BLAS regardless of gene count.

Tutorials

See the example/ directory for comprehensive workflows:

  1. PBMC workflow: Identification of interferon stimulated metagene and cell type annotation
  2. TICA workflow: Cell type assignment
  3. SPECTRUM workflow: Vector arithmetic for site specific metagenes
  4. FITNESS workflow: Identifying increasing metagenes in time series

Detailed Usage Guide

1. Data Loading

GeneVector uses Scanpy AnnData objects and requires raw count data in the .X matrix. It's recommended to subset genes using the seurat_v3 flavor in Scanpy or GeneVector's entropy-based quality control.

from genevector.data import GeneVectorDataset
from genevector.model import GeneVector
from genevector.embedding import GeneEmbedding, CellEmbedding
import scanpy as sc

# Load and preprocess data
adata = sc.read_h5ad("your_data.h5ad")

# Option A: Scanpy's variable gene selection
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]

# Option B: GeneVector's entropy-based gene filtering
adata = GeneVectorDataset.quality_control(adata, entropy_threshold=1.0)

# Create dataset
dataset = GeneVectorDataset(adata)

2. Model Training

Creating a GeneVector object computes the co-expression target (MI by default) and prepares training batches. Training time varies by dataset size — a 10k PBMC dataset trains in under 5 minutes. The emb_dimension parameter controls vector size (minimum 50 recommended).

# Initialize model (triggers MI computation + batch preparation)
model = GeneVector(
    dataset,
    output_file="genes.vec",
    emb_dimension=100,
    device="cpu"
)

# Train for 1000 iterations or until convergence
model.train(1000, threshold=1e-6)

# Visualize training progress
model.plot()

3. Gene Embeddings Analysis

Training produces two vector files (input and output weights). Using the average of both weights is recommended for best results.

# Load gene embeddings
gene_embed = GeneEmbedding("genes.vec", dataset, vector="average")

# Compute gene similarities
similarities_df = gene_embed.compute_similarities("CD8A")
gene_embed.plot_similarities("CD8A", n_genes=10)

# Generate metagenes through clustering
gene_adata = gene_embed.get_adata(resolution=40)
metagenes = gene_embed.get_metagenes(gene_adata)

# Visualize specific metagenes
gene_embed.plot_metagene(gene_adata, mg=metagenes[0])

4. Cell Embeddings

Generate cell embeddings using the trained gene vectors. The embeddings are stored in AnnData format with automatic UMAP generation.

# Create cell embeddings
cell_embed = CellEmbedding(dataset, gene_embed)
adata_embedded = cell_embed.get_adata()

# Visualize with Scanpy
sc.pl.umap(adata_embedded)

# Optional: Batch correction
cell_embed.batch_correct(column="sample", reference="control")
adata_corrected = cell_embed.get_adata()

5. Cell Type Assignment

# Define marker genes for cell types
markers = {
    "T Cell": ["CD3D", "CD3G", "CD3E", "TRAC", "IL32", "CD2"],
    "B/Plasma": ["CD79A", "CD79B", "MZB1", "CD19", "BANK1"],
    "Myeloid": ["LYZ", "CST3", "AIF1", "CD68", "C1QA", "C1QB", "C1QC"]
}

# Perform automated cell type assignment
annotated_adata = cell_embed.phenotype_probability(adata_embedded, markers)

# Visualize results
prob_cols = [col for col in annotated_adata.obs.columns if "Pseudo-probability" in col]
sc.pl.umap(annotated_adata, color=prob_cols + ["genevector"], size=25)

6. Caching

Computed co-expression scores are cached automatically. Control this with:

# Enable caching (default)
dataset = GeneVectorDataset(adata, use_cache=True)

# Disable caching (always recompute)
dataset = GeneVectorDataset(adata, use_cache=False)

# Clear the cache
from genevector.cache import clear_cache
clear_cache()

Cache location: ~/.genevector/cache/. Cache keys incorporate the expression matrix, gene list, target function, and all parameters, so different configurations never collide.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use GeneVector in your research, please cite:

Ceglia, N., Sethna, Z., Freeman, S.S. et al. Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector. Nat Commun 14, 4400 (2023). https://doi.org/10.1038/s41467-023-39985-2

@article{ceglia2023genevector,
  title={Identification of transcriptional programs using dense vector representations defined by mutual information with GeneVector},
  author={Ceglia, Nicholas and Sethna, Zachary and Freeman, Samuel S and others},
  journal={Nature Communications},
  volume={14},
  pages={4400},
  year={2023},
  doi={10.1038/s41467-023-39985-2}
}

Documentation

For detailed documentation and examples, visit: https://genevector.readthedocs.io

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genevector-1.0.0.tar.gz (54.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

genevector-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (306.3 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

genevector-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl (312.5 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

genevector-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (361.0 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file genevector-1.0.0.tar.gz.

File metadata

  • Download URL: genevector-1.0.0.tar.gz
  • Upload date:
  • Size: 54.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for genevector-1.0.0.tar.gz
Algorithm Hash digest
SHA256 4a7c66a34a0f712ef32556cd54f0b65d76ee0e573bf69b7405645312848552a9
MD5 81c45bef58014a17f551c8f46cd4106d
BLAKE2b-256 fa6b81c8879b15a3e2ac82793e1338868b5c520372e72810830aeb7be99dba83

See more details on using hashes here.

File details

Details for the file genevector-1.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genevector-1.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 35c746f1b331b87bee498fa673b08eaa5df65b440601b8a4ee4c12a7b2420941
MD5 f95f1afba72614ee73ab0eb7242f5299
BLAKE2b-256 c1cc834418a188f903ca77fbb451daba92429553597ffeac132be5b7e32543d0

See more details on using hashes here.

File details

Details for the file genevector-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for genevector-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c9044ba244cc0e3dc68f115cb2b00a033c7c3acbde48359b9d860dbeac0d7b59
MD5 3de02922cfcb89f95bfb12161f8c13e9
BLAKE2b-256 0737a9a46a71b0631da70805e4f2c0dda5218db01a65f6c52cf22fd37849658c

See more details on using hashes here.

File details

Details for the file genevector-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genevector-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f4e56235865522bdc1c42376ef853bf1f5a943334801c142efb969d330ba462f
MD5 9a7c43970620fca4f8bf69f079ec2a73
BLAKE2b-256 fa7ed990c704f7b33c2304208aac9ceee43f8960c0f47fa2e197bd37d741c533

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page