Skip to main content

scAgeClock: a single-cell transcriptome based human aging clock model using gated multi-head attention neural networks

Project description

scAgeClock

scAgeClock: a single-cell transcriptome based human aging clock model using gated multi-head attention neural networks

Quick Start

making age prediction by scAgeClock (command-line version)

scAgeClock --model_file ${model_file} --testing_h5ad_files_dir ${h5ad_folder} --output_file ${out_file}

making age prediction by scAgeClock (python-script version)

from scageclock.evaluation import prediction
model_file="scAgeClock_GMA.pth" ## pre-trained scAgeClock GMA model provided by scAgeClock
h5ad_folder="/path/to/h5adfiles/" ## scAgeClock formatted .h5ad files
results_df = prediction(model_file=model_file,
		    h5ad_dir=h5ad_folder)

Installation

install from package

% conda create -n scAgeClock
% conda activate scAgeClock
% conda install python=3.12
% pip install scageclock-0.1.2.tar.gz # download the latest release
#check the command-line version
% scAgeClock --help
usage: scAgeClock [-h] [--model_file MODEL_FILE]
                  [--testing_h5ad_files_dir TESTING_H5AD_FILES_DIR]
                  [--output_file OUTPUT_FILE]

scAgeClock CLI tools

options:
  -h, --help            show this help message and exit
  --model_file MODEL_FILE
                        model file (eg: .pth file generated by scAgeClock GMA)
  --testing_h5ad_files_dir TESTING_H5AD_FILES_DIR
                        directory path to the .h5ad files used for model
                        prediction (format should be matched with scAgeClock
                        requirements)
  --output_file OUTPUT_FILE
                        output file with predicted results
#check the python imports
from scageclock.scAgeClock import training_pipeline
from scageclock.evaluation import prediction

install from remote repository

% pip install scageclock

Information About scAgeClock's Input Dataset

Input Dataset examples

  • feature file: data/metadata/h5ad_var.tsv
  • categorical index: data/metadata/categorical_features_index (assay, sex, tissue_general, and cell_type)
  • h5ad example file: data/pytest_data/k_fold_mode/train_val/Fold1/Pytest_Fold1_200K_chunk27.h5ad (500 cells sampled)
  • shape of anndata from h5ad file: N x 19183, where N is the number of cells

Anndata structure of scAgeClock's Input Dataset

## 19183 features, including 4 categorical features (the first four columns, in the order of assay, cell_type, tissue_general, and sex) and 19179 selected protein coding genes
AnnData object with n_obs × n_vars = 500 × 19183
    obs: 'soma_joinid', 'age'
    var: 'feature_id', 'feature_name'

Formatting your data to scAgeClock's Inputs Format

Click to check the data formatting example code
import scanpy as sc
import pandas as pd
import numpy as np
from scageclock.formatting import format_anndata_multiple
raw_h5ad_file = "/your/raw/inputfile/example.h5ad"
raw_adata_all = sc.read_h5ad(raw_h5ad_file,backed='r')
meta_df = pd.read_parquet("example_meta.parquet") ## metadata for example.h5ad
split_dfs = np.array_split(filtered_meta_df, 10) ## split the cells into 10 chunks (to reduce memory loading while formatting)
###load the matching table for the categorical features and update the .obs dataframe of the original anndata
meta_df = raw_adata_all.obs
cat_index_dict = {}
# matching table needs to be created based on your input anndata's .obs dataframe
# Example matching table files can be found in ./scageclock/data/example/data_formatting/obs_columns_matching_examples
for cat in ["assay","cell_type","tissue","sex"]:
    df = pd.read_excel(f"../{cat}_matching_table.xlsx")
    cat_index_dict[cat] = df
names_dict = {"platform":"assay",
              "cellType1":"cell_type",
              "tissue":"tissue",
             "sex":"sex"}

for original_colname in names_dict.keys():
    model_colname = names_dict[original_colname]
    cat_df = pd.DataFrame({"raw_id": meta_df[original_colname]})
    cat_df_with_index = pd.merge(cat_df, 
                                   cat_index_dict[model_colname], 
                                   left_on="raw_id",
                                   right_on="original_cat_name",
                                  how="left")
    meta_df[f"{model_colname}_index"] = list(cat_df_with_index["model_cat_index"])

## update original obs dataframe with scAgeClock index added
raw_adata_all.obs = meta_df

### loading the model's feature file
model_feature_df = pd.read_csv("./scageclock/data/metadata/h5ad_var.tsv",sep="\t")
model_genes = list(model_feature_df["h5ad_var"])[4:] #get the model's gene features

### refomat for each chunks
chunk_id = 0
for chunk_df in split_dfs:
    chunk_id += 1
    adata_chunk = raw_adata_all[list(chunk_df.index)].to_memory()
    print(adata_chunk.obs_names[0])
    adata_formatted = format_anndata_multiple(adata_raw=adata_chunk,
                                             model_genes=model_genes,
                                             normalize=True,
                                             cat_cols=["assay_index", "cell_type_index", "tissue_index", "sex_index"])
    print(chunk_id)
    adata_formatted.write_h5ad(f"chunk{chunk_id}.h5ad")

scAgeClock model training and age prediction examples

about example data and model

  • example data can be found at "data/pytest_data" of this repository
  • example GMA model file can be found at "data/trained_models/GMA_models" of this repository

current supported model types

  • $${\color{red}GMA\space(Gated \space Multi-head \space Attention \space Neural \space Networks, default \space and \space recommended)}$$
  • MLP (Multi-layer Perceptron)
  • linear (Elastic Net based Linear regression model)
  • xgboost
  • catboost

making age prediction (python script version)

from scageclock.evaluation import prediction
model_file="./data/trained_models/GMA_models/GMA_celltype_balanced_basicRun.pth"
h5ad_folder="./data/pytest_data/train_val_test_mode/test/"
results_df = prediction(model_file=model_file,
		    h5ad_dir=h5ad_folder)

making age prediction (command-line version)

#!/bin/bash
model_file="./data/trained_models/GMA_models/GMA_celltype_balanced_basicRun.pth"
h5ad_folder="./data/pytest_data/train_val_test_mode/test/"
scAgeClock --model_file ${model_file} --testing_h5ad_files_dir ${h5ad_folder} --output_file './tmp/test_predicted.xlsx'

get model feature importance (GMA model)

from scageclock.scAgeClock import load_GMA_model, get_feature_importance
model_file = "./data/trained_models/GMA_models/GMA_celltype_balanced_basicRun.pth"
gma_model = load_GMA_model(model_file)
feature_file = "data/metadata/h5ad_var.tsv"
feature_importance = get_feature_importance(gma_model,feature_file=feature_file)
#sort by feature importance score
feature_importance = feature_importance.sort_values(by="feature_importance",ascending=False)

model training with validation and testing

from scageclock.scAgeClock import training_pipeline
model_name = "GMA" # Gated Multihead Attention Neural Network, default model of scAgeClock
ad_dir_root = "data/pytest_data/train_val_test_mode/"
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training": "train", "validation": "val", "testing": "test"}
predict_dataset = "testing"
loader_method = "scageclock"
out_root_dir = "./tmp/"
results = training_pipeline(model_name=model_name,
			    ad_dir_root=ad_dir_root,
			    meta_file_path=meta_file,
			    dataset_folder_dict=dataset_folder_dict,
			    predict_dataset=predict_dataset,
			    validation_during_training=True,
			    loader_method=loader_method,
			    out_root_dir=out_root_dir)

model training with cross-validation mode (one round)

from scageclock.scAgeClock import training_pipeline
model_name = "GMA" # Gated Multihead Attention Neural Network, default model of scAgeClock
k_fold_data_dir="data/pytest_data/k_fold_mode/" # h5ad files are located at train_val/Fold1; train_val/Fold2; train_val/Fold3
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training_validation": "train_val"}
predict_dataset = "validation" ## prediction based on the validation dataset
loader_method = "scageclock"
out_root_dir = "./tmp/"

results = training_pipeline(model_name=model_name,
			ad_dir_root=k_fold_data_dir,
			meta_file_path=meta_file,
			dataset_folder_dict=dataset_folder_dict,
			predict_dataset=predict_dataset,
			K_fold_mode=True,
			K_fold_train=("Fold1", "Fold2"),
			K_fold_val="Fold3",
			validation_during_training=False,
			loader_method=loader_method,
			out_root_dir=out_root_dir)

model training with cross-validation mode (one round for catboost)

from scageclock.scAgeClock import training_pipeline
model_name = "catboost" # Gated Multihead Attention Neural Network, default model of scAgeClock
ad_dir_root = "data/pytest_data/train_val_test_mode/"
meta_file = "data/pytest_data/pytest_dataset_metadata.parquet"
dataset_folder_dict = {"training": "train", "validation": "val", "testing": "test"}
predict_dataset = "testing"
loader_method = "scageclock"
out_root_dir = "./tmp/"
results = training_pipeline(model_name=model_name,
			    ad_dir_root=ad_dir_root,
			    meta_file_path=meta_file,
			    dataset_folder_dict=dataset_folder_dict,
			    predict_dataset=predict_dataset,
			    validation_during_training=True,
			    loader_method=loader_method,
			    train_dataset_fully_loaded=True, ##make sure the memory is enough
			    out_root_dir=out_root_dir)

About the author

  • Author: Gangcai Xie (Medical School of Nantong University);
  • ORCID

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scageclock-0.1.2.tar.gz (44.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scageclock-0.1.2-py3-none-any.whl (56.7 kB view details)

Uploaded Python 3

File details

Details for the file scageclock-0.1.2.tar.gz.

File metadata

  • Download URL: scageclock-0.1.2.tar.gz
  • Upload date:
  • Size: 44.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.11.0-26-generic

File hashes

Hashes for scageclock-0.1.2.tar.gz
Algorithm Hash digest
SHA256 8a024c8d4641644263a69dfff7a0fbedcf5e883a5b96ca3221dfb34e038f5e00
MD5 b26132d3d93013be021858fb522809ca
BLAKE2b-256 609fc33e86ec1429a801ab3c88ee5ecc68b4c051a155895ddcbe273f175710b9

See more details on using hashes here.

File details

Details for the file scageclock-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scageclock-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 56.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Linux/6.11.0-26-generic

File hashes

Hashes for scageclock-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ecf24a16af537c85f868fd701cae1741ee8fc41a0dda3ea226605eafa07784c4
MD5 47841fef47f0b7aff657eeb81d4941cf
BLAKE2b-256 f9b780351a3557588fc4dbff4f01dc06414546ddde8de03c7a29029910f2b985

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page