No project description provided
Project description
JDtI – Python library for scRNAseq/RNAseq data analysis
Author: Jakub Kubiś
Polish Academy of Sciences
Laboratory of Single Cell Analyses
Description
JDtI enables basic quality control steps such as control of cells per cluster, number of genes per cell, and more advanced tasks like subclustering, integration, and wide visualization. In this approach, we do not drop the cell information during separate set analyses; instead, we use previous cluster cell lineage information for integrating data based on cluster markers and data harmonization. After integration, it is possible to visualize cell interactions and correlations in many ways, including cell distance, correlations, and more.
Despite this, it is also able to conduct DEG analysis between sets, selected cells, or grouped cells, and visualize the results on UMAP, volcano plots, and regression plots comparing pairs of cells. It is very powerful for more advanced analyses focusing on specific issues within the data that may not be discovered in basic analyses.
Additionally, JDtI offers many functions for data visualization and processing within clean visual outputs, such as volcano plots, gene expression analysis of different data types, clustering, heatmaps, and more.
It is compatible with various sequencing approaches, including scRNA-seq and bulk RNA-seq, and supports interoperability with tools such as Seurat, Scanpy, and other bioinformatics frameworks using the 10x sparse matrix format as input. More details about the available functions can be found in the Documentation and Example Usage section on GitHub.
Table of contents
Example usage: 1. Basic functions 2. Data clustering 3. Data integration 4. Data subclustering
Installation
pip install jdti
Documentation
Documentation for classes and functions is available here 👉 Documentation 📄
Example usage
1. Basic functions
1.1. Loading functions
from jdti import *
1.2. Loading data
# load sparse matrix as pd.DataFrame data with creating metadata
data, metadata = load_sparse(path = 'data/set1', name = 'set1')
#load data frame from different data type (.tsv, .txt, .tsv)
data = pd.read_csv('example_data.csv')
# load data from .h5 or other data types and transform to pandas data frame
- Data [features (eg. genes) x sample (eg. cells)]
- Metadata [columns['cell_names', 'sets']]:
- cell_names – sample names corresponding to the columns of Data
- sets – the assignment of each sample to a given dataset, aligned with Data
1.3. Features finding
features = find_features(data, features =['KIT', 'MC1', 'EDNRB', 'PAX3'])
- Not found the MC1 feature name, so the potential names are provided
features = find_features(data, features =['KIT', 'MC1R', 'EDNRB', 'PAX3'])
- All feature names have been found
1.4. Names finding
names = find_names(data, names = ['0', '1', '2','10', '1&'])
- As same as in case of 'Features finding'
1.5. Data reducing
# data reducing on found features and names
data_reduced = reduce_data(data,
features = features['included'],
names = names['included'])
- return data with selected features & names
1.6. Data averaging and occurrence counting
avg_reduced = average(data_reduced)
occ_reduced = occurrence(data_reduced)
- returns the average or occurrence values computed across all columns that share the same name
1.7. Difference counting (DEG) and visualization
# creating group dict for compare samples
compare_dict = {'g1':['0', '1'],
'g2':['2','10']}
deg_df = calc_DEG(data,
metadata_list = None,
entities = compare_dict,
sets = None,
min_exp = 0,
min_pct = 0.1,
n_proc =10)
# DEG visualization with volcano plot
fig = volcano_plot(deg_df,
p_adj = True,
top = 25,
p_val = 0.05,
lfc = 0.25,
standard_scale = False,
rescale_adj = True,
image_width = 12,
image_high = 12)
fig.savefig('volcano.jpeg', dpi=300, bbox_inches='tight')
- DEG data:
feature– Name of the studied featurep_val– P-value (Mann–Whitney) for the studied feature comparing thevalid_groupto all other groups in the analysispct_valid– Percentage of positive (>0) values for the studied feature in thevalid_group*pct_ctrl– Percentage of positive (>0) values for the studied feature in all other groupsavg_valid– Average value of the studied feature in thevalid_groupavg_ctrl– Average value of the studied feature in the remaining groupssd_valid– Standard deviation of the studied feature in thevalid_group*sd_ctrl– Standard deviation of the studied feature in the remaining groupsesm– Cohen’s d effect size metricvalid_group– Name of the sample or group belonging to thevalid_groupadj_pval– Benjamini–Hochberg adjusted p-valueFC– Fold change between the averagedvalid_groupsamples and the averaged remaining sampleslog(FC)– Log₂-transformed fold changenorm_diff– Direct difference between the averagedvalid_groupvalue and the averaged value of the remaining groups
- Volcano plot – Visualization of differentially expressed genes (DEGs) between two groups
1.8. Features visualization
top_10 = deg_df.sort_values(
['p_val', 'esm', 'log(FC)'],
ascending=[True, False, False]).head(10)
data_scatter = reduce_data(data,
features = list(set(top_10['feature'])),
names = names['included'])
avg = average(data_scatter)
occ = occurrence(data_scatter)
fig = features_scatter(expression_data = avg,
occurence_data = occ,
features = None,
metadata_list = None,
colors = 'viridis',
hclust = 'complete',
img_width = 8,
img_high = 5,
label_size = 10,
size_scale = 100,
y_lab = 'Genes',
legend_lab = 'log(CPM + 1)',
bbox_to_anchor_scale = 25,
bbox_to_anchor_perc=(0.91, 0.55),
bbox_to_anchor_group=(1.01, 0.4))
fig.savefig('scatter.jpeg', dpi=300, bbox_inches='tight')
- Scatter plot – Displays expression relationships of DEGs across groups or individual samples
1.9. Relation visualization
fig = development_clust(data = avg,
method = 'ward',
img_width = 5,
img_high = 5)
fig.savefig('development.jpeg', dpi=300, bbox_inches='tight')
- Development plot – A dendrogram showing sample similarity based on the expression features generated using hierarchical clustering
2. Data clustering
from jdti import Clustering, load_sparse
data, metadata = load_sparse(path = 'data/set2', name = 'set2')
clusters = Clustering.add_data_frame(data, metadata)
clusters.clustering_data
clusters.clustering_metadata
clusters.perform_PCA(pc_num=100, width=8, height=6)
clusters.knee_plot_PCA(width=8, height=6)
clusters.harmonize_sets(harmonize_type='harmony')
clusters.find_clusters_PCA(pc_num=0, eps=0.5, min_samples=10, width=8, height=6, harmonized=False)
clusters.perform_UMAP(factorize=False, umap_num=0, pc_num=5, harmonized=False)
clusters.knee_plot_umap(eps=0.5, min_samples=10)
clusters.find_clusters_UMAP(umap_n=5, eps=0.5, min_samples=10, width=8, height=6)
clusters.UMAP_vis(names_slot='cell_names', set_sep=True, point_size=0.6)
clusters.UMAP_feature(feature_name = 'KIT', features_data=None, point_size=0.6)
clusters.get_umap_data()
clusters.get_pca_data()
clusters.return_clusters(clusters='umap')
3. Data integration
from jdti import COMPsc, volcano_plot
jseq_object = COMPsc.project_dir('data', ['set1', 'set2'])
jseq_object.load_sparse_from_projects(normalized_data=True)
dt = jseq_object.get_partial_data(names=['10'], features=['KIT', 'PAX3', 'MITF'], name_slot='cell_names')
jseq_object.gene_histograme(bins=100)
jseq_object.gene_threshold(min_n = 50, max_n = 3000)
jseq_object.gene_histograme(bins=100)
jseq_object.reduce(reg = '5', inc_set = False)
jseq_object.gene_histograme(bins=100)
jseq_object.cell_histograme(name_slot = 'cell_names')
jseq_object.cluster_threshold(min_n = 20, name_slot = 'cell_names')
jseq_object.cell_histograme(name_slot = 'cell_names')
# returny
met = jseq_object.input_metadata
data = jseq_object.get_data(set_info=True)
metadata = jseq_object.get_metadata()
jseq_object.calculate_difference_markers(min_exp = 0,
min_pct = 0.25,
n_proc=10,
force = False)
jseq_object.estimating_similarity(method = 'pearson',
p_val = 0.05,
top_n = 10)
pl = jseq_object.similarity_plot(split_sets = True,
set_info = True,
cmap='seismic',
width = 16, height = 14)
# pl.savefig(f'sim_plot_top_{top}.svg', dpi=300, bbox_inches='tight')
pl2 = jseq_object.spatial_similarity(set_info= True,
bandwidth = 1,
n_neighbors = 5,
min_dist = 0.1,
legend_split = 2,
point_size = 20,
spread=1.0,
set_op_mix_ratio=1.0,
local_connectivity=1,
repulsion_strength=1.0,
negative_sample_rate=5,
width = 12,
height = 10)
pl2.savefig(f'sim_plot_map_top_{top}.svg', dpi=300, bbox_inches='tight')
sim_data = jseq_object.similarity sim_data = sim_data[sim_data['set1'] != sim_data['set2']]
jseq_object.cell_regression( cell_x = '2', cell_y = '6', set_x = 'set1', set_y = 'set2', threshold = 6, image_width = 12, image_high = 7, color = 'black')
jseq_object.clustering_features(name_slot = 'cell_names', features_list = None, p_val = 0.05, top_n = 10, adj_mean = False, beta = 0.2)
jseq_object.perform_PCA(pc_num = 50)
jseq_object.knee_plot_PCA()
jseq_object.harmonize_sets(harmonize_type = 'harmony')
# jseq_object.find_clusters_PCA(pc_num = 100, eps = 0.5, min_samples = 10)
jseq_object.perform_UMAP(factorize=False, umap_num = 2, pc_num = 10, harmonized = True)
# jseq_object.knee_plot_umap(eps = 0.5, min_samples = 10)
# jseq_object.find_clusters_UMAP(umap_n = 6, eps = 1, min_samples = 20)
plu = jseq_object.UMAP_vis(
names_slot = 'cell_names',
set_sep = True,
point_size = 1,
font_size = 6,
legend_split_col = 2,
width = 8,
height = 6,
inc_num = True)
# plu.savefig(f'sim_umap_top.svg', dpi=300, bbox_inches='tight')
plu = jseq_object.UMAP_vis(
names_slot = 'sets',
set_sep = True,
point_size = 1,
font_size = 6,
legend_split_col = 1,
width = 8,
height = 6,
inc_num = False)
# plu.savefig(f'sim_umap_sets_top_.svg', dpi=300, bbox_inches='tight')
vis = jseq_object.UMAP_feature(
features_data = jseq_object.get_data(set_info = False) ,
feature_name = 'MAP1B',
point_size = 0.6,
font_size = 6,
width = 8,
height = 6,
palette = 'light')
# vis.savefig(f'sim_umap_sets_top_vis.svg', dpi=300, bbox_inches='tight')
jseq_object.var_data
# jseq_object.save_project(name = 'topola')
stats = jseq_object.statistic(cells=None, sets='All', min_exp=0, min_pct=0.025, n_proc=10)
stats_5 = stats.sort_values(['valid_group', 'esm', 'log(FC)'], ascending=[True, False, False]).groupby('valid_group').head(5)
fig = volcano_plot(stats)
jseq_object.scatter_plot(
names = None,
features = list(set(stats_5['feature'])),
name_slot = 'cell_names',
scale = False,
colors = 'viridis',
hclust = 'complete',
img_width = 15,
img_high = 3,
label_size = 10,
size_scale = 200,
x_lab = 'Genes',
legend_lab = 'log(CPM + 1)',
set_box_size = 5,
set_box_high = 0.1,
bbox_to_anchor_scale = 25,
bbox_to_anchor_perc=(0.90, 0.5),
bbox_to_anchor_group=(0.9, 0.3))
import re
jseq_object.data_composition(
features_count = list(set([re.sub(r' .*$', '',x) for x in list(set(jseq_object.input_metadata['cell_names']))])),
name_slot = 'cell_names',
set_sep = True
)
jseq_object.composition_pie(
width = 6,
height = 6,
font_size = 15,
cmap = "tab20",
legend_split_col = 1,
offset_labels = 0.5,
legend_bbox = (1.15, 0.95))
jseq_object.bar_composition(
cmap = 'tab20b',
width = 2,
height = 6,
font_size = 15,
legend_split_col = 1,
legend_bbox = (1.3, 1))
4. Data subclustering
from jdti import COMPsc
jseq_object = COMPsc.project_dir('data', ['set2'])
jseq_object.load_sparse_from_projects(normalized_data=True)
jseq_object.subcluster_prepare(features = ['HMGCS1', 'MAP1B', 'SOX4'],
cluster='10')
jseq_object.define_subclusters(
umap_num = 5,
eps = 1,
min_samples = 5,
n_neighbors = 5,
min_dist = 0.1,
spread = 1.0,
set_op_mix_ratio = 1.0,
local_connectivity = 1,
repulsion_strength = 1.0,
negative_sample_rate = 5,
width = 8,
height = 6)
jseq_object.subcluster_features_scatter(
colors = 'viridis',
hclust = 'complete',
img_width = 3,
img_high = 5,
label_size = 6,
size_scale = 70,
x_lab = 'Genes',
legend_lab = 'normalized')
mapping = {
"old_name": ["-1", "1", "4"],
"new_name": ["1", "1", "1"]
}
jseq_object.rename_subclusters(mapping)
jseq_object.subcluster_DEG_scatter(
top_n = 3,
min_exp = 0,
min_pct = 0.1,
p_val = 0.05,
colors = 'viridis',
hclust = 'complete',
img_width = 3,
img_high = 5,
label_size = 6,
size_scale = 70,
x_lab = 'Genes',
legend_lab = 'normalized',
n_proc=10)
jseq_object.accept_subclusters()
l = set(jseq_object.input_metadata['cell_names'])
Have fun JBS
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jdti-0.1.1.tar.gz.
File metadata
- Download URL: jdti-0.1.1.tar.gz
- Upload date:
- Size: 46.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e95f7792598b3d02d2b590d39219be8508bc7191cc9c94afdc0c23c3bbf8eff
|
|
| MD5 |
95c0fb056ff1147683be234b7c1ea2eb
|
|
| BLAKE2b-256 |
6007e15dce37ff78a7d92a0905f7f500192b11b05aca7b5f638d73a047a89cc2
|
File details
Details for the file jdti-0.1.1-py3-none-any.whl.
File metadata
- Download URL: jdti-0.1.1-py3-none-any.whl
- Upload date:
- Size: 46.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.10.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3dfd55a92d9bb07141c9f8a22f36be0f7427ba63430c248bb6c12287154259e
|
|
| MD5 |
4ed5d719fcaec60edbe8a3e7c276d81b
|
|
| BLAKE2b-256 |
8b36209ebde196f44e970a0159eaefdbfce0048a92a620f292c06b6f1dbb7cfe
|