No project description provided
Project description
Via
VIA is a single-cell Trajectory Inference method that offers topology construction, pseudotimes, automated terminal state prediction and automated plotting of temporal gene dynamics along lineages. VIA combines lazy-teleporting random walks and Monte-Carlo Markov Chain simulations to overcome common challenges such as 1) accurate terminal state and lineage inference, 2) ability to capture combination of cyclic, disconnected and tree-like structures, 3) scalability in feature and sample space. It is also well-suited for multi-omic analysis. In addition to transcriptomic data, VIA works on scATAC-seq, flow and imaging cytometry data
Getting Started
install using pip
We recommend setting up a new conda environment
conda create --name ViaEnv pip
pip install pyVIA // tested on linux
install by cloning repository and running setup.py (ensure dependencies are installed)
git clone https://github.com/ShobiStassen/VIA.git
python3 setup.py install // cd into the directory of the cloned PARC folder containing setup.py and issue this command
install dependencies separately if needed (linux)
If the pip install doesn't work, it usually suffices to first install all the requirements (using pip) and subsequently install VIA (also using pip)
pip install python-igraph, leidenalg>=0.7.0, hnswlib, umap-learn, numpy>=1.17, scipy, pandas>=0.25, sklearn, termcolor, pygam, phate
pip install pyVIA
Examples
1.a Human Embryoid Bodies (wrapper function)
1.b Human Embryoid Bodies (Configuring VIA)
2.a Toy Data (multifurcation)
2.b Toy Data (disconnected)
3.a General input format and wrapper function
3.b General disconnected trajectories wrapper function
1.a Human Embryoid Bodies (wrapper function)
save the Raw data matrix as 'EBdata.mat'. The cells in this file have been filtered for too small/large libraries by Moon et al. 2019
The function main_EB_clean() preprocesses the cells (normalized by library size, sqrt transformation). It then calls VIA to: plot the pseudotimes, terminal states, lineage pathways and gene-clustermap. The visualization method used in this function is PHATE.
import pyVia.core as via
via.main_EB_clean(ncomps=30, knn=20, p0_random_seed=20, foldername = '') # Most reasonable parameters of ncomps (10-200) and knn (15-50) work well
1.b Human Embryoid Bodies (Configuring VIA)
If you wish to run the data using UMAP or TSNE (instead of PHATE), or require more control of the parameters/outputs, then use the following code:
import pyVia.core as via
#pre-process the data as needed and provide to via as a numpy array
#root_user is the index of the cell corresponding to a suitable start/root cell
v0 = VIA(input_data, time_labels, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=v0_too_big, root_user=1, dataset='EB', random_seed=v0_random_seed,
do_magic_bool=True, is_coarse=True, preserve_disconnected=True)
v0.run_VIA()
tsi_list = get_loc_terminal_states(v0, input_data) #translate the terminal clusters found in v0 to the fine-grained run in v1
v1 = VIA(input_data, time_labels, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=v1_too_big, super_cluster_labels=v0.labels, super_node_degree_list=v0.node_degree_list,
super_terminal_cells=tsi_list, root_user=1, is_coarse=False, full_neighbor_array=v0.full_neighbor_array,
full_distance_array=v0.full_distance_array, ig_full_graph=v0.ig_full_graph,
csr_array_locally_pruned=v0.csr_array_locally_pruned,
x_lazy=0.95, alpha_teleport=0.99, preserve_disconnected=True, dataset='EB',
super_terminal_clusters=v0.terminal_clusters, random_seed=21)
v1.run_VIA()
#Plot the true and inferred times and pseudotimes
#Replace Y_phate with UMAP, TSNE embedding
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
ax1.scatter(Y_phate[:, 0], Y_phate[:, 1], c=time_labels, s=5, cmap='viridis', alpha=0.5)
ax2.scatter(Y_phate[:, 0], Y_phate[:, 1], c=v1.single_cell_pt_markov, s=5, cmap='viridis', alpha=0.5)
ax1.set_title('Embyroid Data: Days')
ax2.set_title('Embyroid Data: VIA')
plt.show()
#obtain the single-cell locations of the terminal clusters to be used for visualization of trajectories/lineages
super_clus_ds_PCA_loc = via.sc_loc_ofsuperCluster_PCAspace(v0, v1, np.arange(0, len(v1.labels)))
#draw the overall lineage paths on the embedding
draw_trajectory_gams(Y_phate, super_clus_ds_PCA_loc, v1.labels, v0.labels, v0.edgelist_maxout,
v1.x_lazy, v1.alpha_teleport, v1.single_cell_pt_markov, time_labels, knn=v0.knn,
final_super_terminal=v1.revised_super_terminal_clusters,
sub_terminal_clusters=v1.terminal_clusters,
title_str='Pseudotime and path', ncomp=ncomps)
2D_knn_hnsw = via.make_knn_embeddedspace(Y_phate) #used to visualize the path obtained in the high-dimensional KNN
#draw the individual lineage paths and cell-fate probabilities at single-cell level
via.draw_sc_evolution_trajectory_dijkstra(v1, Y_phate, 2D_knn_hnsw, v0.full_graph_shortpath,
idx=np.arange(0, input_data.shape[0]))
plt.show()
2.a/b Toy data (Multifurcation and Disconnected)
Two examples toy datasets with annotations are generated using DynToy are provided.
import pyVia.core as via
#multifurcation
#the root is automatically set to root_user = 'M1'
via.main_Toy(ncomps=10, knn=30,dataset='Toy3', random_seed=2,foldername = ".../Trajectory/Datasets/") #multifurcation
#disconnected trajectory
#the root is automatically set as a list root_user = ['T1_M1', 'T2_M1'] # e.g. T2_M3 is a cell belonging to the 3rd Milestone (M3) of the second Trajectory (T2)
via.main_Toy(ncomps=10, knn=30,dataset='Toy4',random_seed=2,foldername =".../Trajectory/Datasets/") #2 disconnected trajectories
Output of Multifurcating toy dataset
Output of disconnected toy dataset
3.a General input format and wrapper function (uses example of pre-B cell differentiation)
Datasets and labels used in this example are provided in Datasets.
# Read the two files:
# 1) the first file contains 200PCs of the Bcell filtered and normalized data for the first 5000 HVG.
# 2)The second file contains raw count data for marker genes
data = pd.read_csv('./Bcell_200PCs.csv')
data_genes = pd.read_csv('./Bcell_markergenes.csv')
data_genes = data_genes.drop(['cell'], axis=1)
true_label = data['time_hour']
data = data.drop(['cell', 'time_hour'], axis=1)
adata = sc.AnnData(data_genes)
adata.obsm['X_pca'] = data.values
# use UMAP or PHate to obtain embedding that is used for single-cell level visualization
embedding = umap.UMAP(random_state=42, n_neighbors=15, init='random').fit_transform(data.values[:, 0:5])
# list marker genes or genes of interest if known in advance. otherwise marker_genes = []
marker_genes = ['Igll1', 'Myc', 'Slc7a5', 'Ldha', 'Foxo1', 'Lig4', 'Sp7'] # irf4 down-up
# call VIA. We identify an early (suitable) start cell root = [42]. Can also set an arbitrary value
via.via_wrapper(adata, true_label, embedding, knn=20, ncomps=20, jac_std_global=0.15, root=[42], dataset='',
random_seed=1,v0_toobig=0.3, v1_toobig=0.1, marker_genes=marker_genes)
3.b VIA wrapper for generic disconnected trajectory
#foldername corresponds to the location where you have saved the Toy Disconnected data (shown in example 2)
#Read in the data and labels
df_counts = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000.csv", 'rt', delimiter=",")
df_ids = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000_ids.csv", 'rt', delimiter=",")
# Make AnnData object for wrapper function to read-in data and do PCA
df_ids['cell_id_num'] = [int(s[1::]) for s in df_ids['cell_id']]
df_counts = df_counts.drop('Unnamed: 0', 1)
df_ids = df_ids.sort_values(by=['cell_id_num'])
df_ids = df_ids.reset_index(drop=True)
true_label = df_ids['group_id']
adata_counts = sc.AnnData(df_counts, obs=df_ids)
sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=ncomps)
#Since there are 2 disconnected trajectories, we provide 2 arbitrary roots (start cells).If there are more disconnected paths, then VIA arbitrarily selects roots. #The root can also just be arbitrarily set as [1] and VIA can detect how many additional roots it must add
via_wrapper_disconnected(adata_counts, true_label, embedding=adata_counts.obsm['X_pca'][:, 0:2], root=[1,1], preserve_disconnected=True, knn=30, ncomps=10,cluster_graph_pruning_std = 1)
#in the case of connected data (i.e. only 1 graph component. e.g. Toy Data Multifurcating) then the wrapper function from example 3.a can be used:
#via_wrapper(adata_counts, true_label, embedding= adata_counts.obsm['X_pca'][:,0:2], root=[1], knn=30, ncomps=10,cluster_graph_pruning_std = 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.