CMV serostatus predictor using scRNA-seq data
Project description
CMVerify: CMV Serostatus Predictor
CMVerify is a tool designed to predict Cytomegalovirus (CMV) serostatus based on single-cell RNA sequencing data. It uses a machine learning model to analyze blood transcriptomic data and provide predictions on CMV status. The tool is built to be easy to use, with automatic data preprocessing, model annotation, and prediction generation.
Installation
To install CMVerify, you can use pip:
pip install CMVerify
Alternatively, if you'd like to install the package from source, the repo will be made public upon publication:
git clone https://github.com/UcarLab/CMVerify.git
cd CMVerify
pip install .
Requirements
- Python 3.8+
scanpymatplotlibnumpypandasseabornanndatacelltypistipythonscikit-learnjoblib
These dependencies are automatically installed when you install CMVerify.
R Users
For R users please refer to this vignette for converting from Seurat to Anndata Further support for R users coming in version 2.
Usage
Here is how you can use CMVerify in your Python environment:
1. Importing CMVerify
To use CMVerify, start by importing the necessary module in your Python script:
from cmverify import predict, visualize
2. Data Preparation
Before running predictions, ensure your data is prepared correctly. CMVerify requires raw counts for accurate predictions.
If raw counts are in .raw, convert .raw to .X:
You can convert your AnnData object from .raw to .X as follows:
import scanpy as sc
# Read the .h5ad file containing the raw data
adata_pre = sc.read_h5ad('path_to_data.h5ad')
# Use the raw counts for analysis (and retain metadata in .obs and .var)
adata = sc.AnnData(X=adata_pre.raw.X, obs=adata_pre.obs, var=adata_pre.raw.var)
Please ensure that the data used for predictions is based on raw counts, which is essential for the analysis.
3. Making Predictions
To make predictions, you need to load your single-cell RNA-seq data (AnnData object) and provide the relevant parameters. The predict function handles data normalization, model annotation, and CMV status prediction.
Example:
import scanpy as sc
from cmverify import predict
# Load your single-cell RNA-seq data (AnnData object)
adata = sc.read('path_to_data.h5ad')
# Specify the columns for donor and longitudinal data
donor_obs_column = 'donor_id'
longitudinal_obs_column = 'timepoint'
# Predict CMV status
results = predict(adata, donor_obs_column, longitudinal_obs_column)
# Output the predictions
print(results)
4. Return Fractions
You can also return the calculated cell type fractions along with the predictions by setting the return_frac parameter to True.
results, fractions_df = predict(adata, donor_obs_column, longitudinal_obs_column, return_frac=True)
# Display the first 5 rows of the cell type fractions
print(fractions_df.head())
4b. Append Ground Truth CMV Status (Optional)
If you have true CMV status in a separate metadata file (not in adata.obs), you can use the append_status function to match and append it to the prediction output. This method supports dataframe with patient/cmv status column or dict input type (key=patient, value=CMV status).
from cmverify import append_status
# Add true labels to predictions
append_status(results, cmv_metadata)
5. Visualize Longitudinal Predictions (and optionally test the model with ground truth)
You can visualize longitudinal CMV prediction probabilities across timepoints using the visualize function.
from cmverify import visualize
visualize(results)
This will generate a figure connecting donor predictions across visits and mark the decision threshold.
You can also set metrics=True if you have ground truth CMV serostatus and wish to evaluate this model.
Functions
predict(adata, donor_obs_column, longitudinal_obs_column=None, verbose=1, return_frac=False, true_status=None, norm=True, force_norm=False)
Predict CMV serostatus from single-cell RNA-seq data using CMVerify. This function handles normalization, transformation, annotation, cell type fraction calculation, and model inference.
Parameters:
adata(AnnData): Single-cell RNA-seq data object.donor_obs_column(str): The column inadata.obsthat contains the donor or sample identifiers.longitudinal_obs_column(str, optional): Column inadata.obsfor timepoint or longitudinal visit labels, if applicable. Default isNone.verbose(int, optional): Verbosity level for progress messages. (0 = silent, 1 = standard output). Default is 1.return_frac(bool, optional): IfTrue, return the cell type fractionDataFramealong with predictions. Default isFalse.true_status(str, optional): Column inadata.obsfor true donor serostatus (ground truth) for evaluation. Default isNone.norm(bool, optional): Whether to normalize to 10,000 counts per cell and log1p-transform. Disable only if raw counts are inadata.X. Default isTrue.force_norm(bool, optional): If the adata has the log1p layer but has not been normalized, user may encounter error from celltypist annotation step. Setforce_norm=Trueto force the normalization and resolve the issue. Default is False.
Returns:
List[Dict]: List of dictionaries containing donor ID, timepoint (if applicable), predicted label, and probability of CMV seropositivity from CMVerify.- If
return_frac=True, also returns a DataFrame with cell type fractions and donor_id-timepoint metadata.
predict_from_frac(fractions_df, verbose=1)
Apply CMVerify to a precomputed DataFrame of cell type fractions.
Parameters:
fractions_df(DataFrame): A DataFrame where rows are donor-timepoints and columns are cell type fractions. The last column must contain the donor-timepoint info as a tuple (donor_id,timepoint).verbose(int, optional): Verbosity level for progress messages. (0 = silent, 1 = standard output). Default is 1.
Returns:
List[Dict]: List of dictionaries containing donor ID, timepoint (if applicable), predicted label, and probability of CMV seropositivity from CMVerify.
visualize(results, visit_order=None, figWidth=8, figHeight=3, dpi_param=100, save=False, filename='cmverify_viz.png', metrics=False)
This function visualizes CMV prediction probabilities per donor, and if applicable for each timepoint.
Parameters:
results(List[Dict]): A list of dictionaries with keys'donor_id_timepoint', i.e., tuple: (donor_id,timepoint),'probability'(float), and optionallytrue_label.visit_order(List[str], optional): List specifying the order of visit labels (e.g.,["Baseline", "Visit 1", "Visit 2"]). Default isNone.figWidth,figHeight(float, optional): Figure dimensions in inches. Default is 8 x 3.dpi_param(int, optional): Dots-per-inch resolution for the plot. Default is 100.save(bool, optional): IfTrue, saves the plot as an image file. Default isFalse.filename(str, optional): Output filename if saving. Default is ending withcmverify_viz.png(multiple figures will be generated with different prefix).metrics(bool, optional): IfTrue, outputs additional metrics like confusion matrix, roc-curve (requirestrue_labelin results). Default isFalse.
append_status(intermed_cmv_predictions, cmv_df, patient_col='patientID', cmv_col='CMV')
This utility function appends true CMV status to the intermediate prediction output by matching donor IDs with a reference CMV status DataFrame. Use this if you have CMV status but it is not in the adata.
Parameters:
intermed_cmv_predictions(List[Dict]): Output frompredict, contains'donor_id_timepoint'(tuple).cmv_df(DataFrameordict): Apandas.DataFrameor dict containing known CMV serostatus for each donor.patient_col(str, optional): The name of the column incmv_dfthat contains donor/patient IDs.cmv_col(str, optional): The name of the column incmv_dfthat contains CMV status values (e.g., 0 for negative, 1 for positive).
Returns:
- Updates intermediate predictions in place with new key
true_labeladded to each dictionary entry.
Model Training
CMVerify uses a random forest classifier (rf_best_estimator) and a corresponding scaler (rf_scaler). These models have been trained on relevant single-cell RNA-seq data and are used to predict CMV serostatus based on cell type composition.
Many thanks to the Allen Institute for sharing their cohort data and enhancing reproducibility in science. Below are some helpful links to their preprint, analysis pipeline, scRNA-seq data, and celltypist models.
- Gong et al. 2024 preprint via bioRxiv
- Immune Health Atlas Analysis
- scRNA-seq Downloads – Dynamics of Immune Health with Age
- Model Downloads – Immune Health Atlas
Contributing
If you'd like to contribute to the development of CMVerify, please fork the repository and submit pull requests with proposed changes. All contributions are welcome!
License
CMVerify is released under the AGPL-3.0 License. See LICENSE for more information.
Thank you for using CMVerify! If you have any issues or questions, please feel free to open an issue on the repository or contact me at luke.trinity@jax.org
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cmverify-1.0.3.tar.gz.
File metadata
- Download URL: cmverify-1.0.3.tar.gz
- Upload date:
- Size: 712.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7342362b3f68605d6e49c90acd98dd5ddd37a8c7d6d344a6ddb7ac1f0362ee65
|
|
| MD5 |
33de52f0adbf11cf61e5ec8ec3fe7699
|
|
| BLAKE2b-256 |
bcea46c5e26b814b48c92339610b01e5219b19b55590450f34a88562796960cb
|
File details
Details for the file cmverify-1.0.3-py3-none-any.whl.
File metadata
- Download URL: cmverify-1.0.3-py3-none-any.whl
- Upload date:
- Size: 717.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8932cbcaa0281a1a9fada4967dcaa60242bdb32636fed62ba01e4b8e4e7fb831
|
|
| MD5 |
9710b7de0d9ddde266dd26a94d52505f
|
|
| BLAKE2b-256 |
27cadf84f23fa1511c8810bc98a0a323da04bde93235b55a93e61373b5504f69
|