scAgeClock: a single-cell transcriptome based human aging clock model using gated multi-head attention neural networks
Project description
scAgeClock
scAgeClock: a single-cell transcriptome based human aging clock model using gated multi-head attention neural networks
Quick Start
making age prediction by scAgeClock (command-line version)
scAgeClock --model_file ${model_file} --testing_h5ad_files_dir ${h5ad_folder} --output_file ${out_file}
making age prediction by scAgeClock (python-script version)
from scageclock.evaluation import prediction
model_file="scAgeClock_GMA_model_state_dict.pth" ## pre-trained scAgeClock GMA model provided by scAgeClock
h5ad_folder="/path/to/h5adfiles/" ## scAgeClock formatted .h5ad files
results_df = prediction(model_file=model_file,
h5ad_dir=h5ad_folder)
Installation
install from PyPI
% conda create -n scAgeClock
% conda activate scAgeClock
% conda install python=3.12
% pip install scageclock
install from latest package
% conda create -n scAgeClock
% conda activate scAgeClock
% conda install python=3.12
% pip install scageclock-0.1.3.tar.gz # download the latest release
check for the installation
check scAgeClock command
% scAgeClock --help
usage: scAgeClock [-h] [--model_file MODEL_FILE]
[--testing_h5ad_files_dir TESTING_H5AD_FILES_DIR]
[--output_file OUTPUT_FILE]
scAgeClock CLI tools
options:
-h, --help show this help message and exit
--model_file MODEL_FILE
model file (eg: .pth file generated by scAgeClock GMA)
--testing_h5ad_files_dir TESTING_H5AD_FILES_DIR
directory path to the .h5ad files used for model
prediction (format should be matched with scAgeClock
requirements)
--output_file OUTPUT_FILE
output file with predicted results
check scAgeClock functions
#check the python imports
from scageclock.scAgeClock import training_pipeline
from scageclock.evaluation import prediction
Information About scAgeClock's Input Dataset
Input Dataset examples
- feature file: data/metadata/h5ad_var.tsv
- categorical index: data/metadata/categorical_features_index (assay, sex, tissue_general, and cell_type)
- h5ad example file: data/pytest_data/k_fold_mode/train_val/Fold1/Pytest_Fold1_200K_chunk27.h5ad (500 cells sampled)
- shape of anndata from h5ad file: N x 19238, where N is the number of cells
Anndata structure of scAgeClock's Input Dataset
## 19238 features, including 4 categorical features (the first four columns, in the order of assay, cell_type, tissue_general, and sex) and 19179 selected protein coding genes
AnnData object with n_obs × n_vars = 500 × 19238
obs: 'soma_joinid', 'age'
var: 'feature_id', 'feature_name'
Formatting your data to scAgeClock's Inputs Format
Click to check the data formatting example code
import scanpy as sc
import pandas as pd
import numpy as np
from scageclock.formatting import format_anndata_multiple
raw_h5ad_file = "/your/raw/inputfile/example.h5ad"
raw_adata_all = sc.read_h5ad(raw_h5ad_file,backed='r')
meta_df = pd.read_parquet("example_meta.parquet") ## metadata for example.h5ad
split_dfs = np.array_split(filtered_meta_df, 10) ## split the cells into 10 chunks (to reduce memory loading while formatting)
###load the matching table for the categorical features and update the .obs dataframe of the original anndata
meta_df = raw_adata_all.obs
cat_index_dict = {}
# matching table needs to be created based on your input anndata's .obs dataframe
# Example matching table files can be found in ./scageclock/data/example/data_formatting/obs_columns_matching_examples
for cat in ["assay","cell_type","tissue","sex"]:
df = pd.read_excel(f"../{cat}_matching_table.xlsx")
cat_index_dict[cat] = df
names_dict = {"platform":"assay",
"cellType1":"cell_type",
"tissue":"tissue",
"sex":"sex"}
for original_colname in names_dict.keys():
model_colname = names_dict[original_colname]
cat_df = pd.DataFrame({"raw_id": meta_df[original_colname]})
cat_df_with_index = pd.merge(cat_df,
cat_index_dict[model_colname],
left_on="raw_id",
right_on="original_cat_name",
how="left")
meta_df[f"{model_colname}_index"] = list(cat_df_with_index["model_cat_index"])
## update original obs dataframe with scAgeClock index added
raw_adata_all.obs = meta_df
### loading the model's feature file
model_feature_df = pd.read_csv("./scageclock/data/metadata/h5ad_var.tsv",sep="\t")
model_genes = list(model_feature_df["h5ad_var"])[4:] #get the model's gene features
### refomat for each chunks
chunk_id = 0
for chunk_df in split_dfs:
chunk_id += 1
adata_chunk = raw_adata_all[list(chunk_df.index)].to_memory()
print(adata_chunk.obs_names[0])
adata_formatted = format_anndata_multiple(adata_raw=adata_chunk,
model_genes=model_genes,
normalize=True,
cat_cols=["assay_index", "cell_type_index", "tissue_index", "sex_index"])
print(chunk_id)
adata_formatted.write_h5ad(f"chunk{chunk_id}.h5ad")
scAgeClock model training and age prediction examples
about example data and model
- example data can be found at "data/pytest_data" of this repository
- example GMA model file can be found at "data/trained_models" of this repository
current supported model types
- $${\color{red}GMA\space(Gated \space Multi-head \space Attention \space Neural \space Networks, default \space and \space recommended)}$$
- MLP (Multi-layer Perceptron)
- linear (Elastic Net based Linear regression model)
- xgboost
- catboost
making age prediction (python script version)
from scageclock.evaluation import prediction
model_file="./data/trained_models/scAgeClock_GMA_model_state_dict.pth"
h5ad_folder="./data/pytest_data/train_val_test_mode/test/"
results_df = prediction(model_file=model_file,
h5ad_dir=h5ad_folder)
making age prediction (command-line version)
#!/bin/bash
model_file="./data/trained_models/scAgeClock_GMA_model_state_dict.pth"
h5ad_folder="./data/pytest_data/train_val_test_mode/test/"
scAgeClock --model_file ${model_file} --testing_h5ad_files_dir ${h5ad_folder} --output_file './tmp/test_predicted.xlsx'
get model feature importance (GMA model)
from scageclock.scAgeClock import load_GMA_model, get_feature_importance
model_file = "./data/trained_models/scAgeClock_GMA_model_state_dict.pth"
gma_model = load_GMA_model(model_file)
feature_file = "data/metadata/h5ad_var.tsv"
feature_importance = get_feature_importance(gma_model,feature_file=feature_file)
#sort by feature importance score
feature_importance = feature_importance.sort_values(by="feature_importance",ascending=False)
model training with validation and testing
from scageclock.scAgeClock import training_pipeline
model_name = "GMA" # Gated Multihead Attention Neural Network, default model of scAgeClock
ad_dir_root = "data/pytest_data/train_val_test_mode/"
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training": "train", "validation": "val", "testing": "test"}
predict_dataset = "testing"
loader_method = "scageclock"
out_root_dir = "./tmp/"
results = training_pipeline(model_name=model_name,
ad_dir_root=ad_dir_root,
meta_file_path=meta_file,
dataset_folder_dict=dataset_folder_dict,
predict_dataset=predict_dataset,
validation_during_training=True,
loader_method=loader_method,
out_root_dir=out_root_dir)
model training with cross-validation mode (one round)
from scageclock.scAgeClock import training_pipeline
model_name = "GMA" # Gated Multihead Attention Neural Network, default model of scAgeClock
k_fold_data_dir="data/pytest_data/k_fold_mode/" # h5ad files are located at train_val/Fold1; train_val/Fold2; train_val/Fold3
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training_validation": "train_val"}
predict_dataset = "validation" ## prediction based on the validation dataset
loader_method = "scageclock"
out_root_dir = "./tmp/"
results = training_pipeline(model_name=model_name,
ad_dir_root=k_fold_data_dir,
meta_file_path=meta_file,
dataset_folder_dict=dataset_folder_dict,
predict_dataset=predict_dataset,
K_fold_mode=True,
K_fold_train=("Fold1", "Fold2"),
K_fold_val="Fold3",
validation_during_training=False,
loader_method=loader_method,
out_root_dir=out_root_dir)
model training with cross-validation mode (one round for catboost)
from scageclock.scAgeClock import training_pipeline
model_name = "catboost" # Gated Multihead Attention Neural Network, default model of scAgeClock
ad_dir_root = "data/pytest_data/train_val_test_mode/"
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training": "train", "validation": "val", "testing": "test"}
predict_dataset = "testing"
loader_method = "scageclock"
out_root_dir = "./tmp/"
results = training_pipeline(model_name=model_name,
ad_dir_root=ad_dir_root,
meta_file_path=meta_file,
dataset_folder_dict=dataset_folder_dict,
predict_dataset=predict_dataset,
validation_during_training=True,
loader_method=loader_method,
train_dataset_fully_loaded=True, ##make sure the memory is enough
out_root_dir=out_root_dir)
About the author
- Author: Gangcai Xie (Medical School of Nantong University);
- ORCID
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scageclock-0.1.3.tar.gz
(45.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scageclock-0.1.3.tar.gz.
File metadata
- Download URL: scageclock-0.1.3.tar.gz
- Upload date:
- Size: 45.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.11.0-29-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16167c9d6a88a90d6ba3cd3994157bd59d2d9e3c2ae4769bfe8427a0345de23b
|
|
| MD5 |
cd1e80a7df2e326b6ae08a90feae0074
|
|
| BLAKE2b-256 |
931e92b65aef4514dbeb22a2ca8ec135a38883af2f1d84c5c3f179bf414a87f6
|
File details
Details for the file scageclock-0.1.3-py3-none-any.whl.
File metadata
- Download URL: scageclock-0.1.3-py3-none-any.whl
- Upload date:
- Size: 57.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.11.0-29-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa3a4de8e9fdb765cc77df2f354b03d49ebab22b110dfbdef271933f0d84aa69
|
|
| MD5 |
b5aef517a9a50170bc4dbea2866a1932
|
|
| BLAKE2b-256 |
bf76a5d77b3b98220c7ed70e8752fb44c70ae89efcbd8d63d082fb29e81234a0
|