Skip to main content

No project description provided

Project description

JDtI – Python library for scRNAseq/RNAseq data analysis

Python version License Docs

drawing

Author: Jakub Kubiś

Institute of Bioorganic Chemistry
Polish Academy of Sciences
Laboratory of Single Cell Analyses

Description

JDtI (JDataIntegration) is a Python library for data integration and advanced post-processing of single-cell datasets.

JDtI enables basic quality control steps such as control of cells per cluster, number of genes per cell, and more advanced tasks like subclustering, integration, and wide visualization. In this approach, we do not drop the cell information during separate set analyses; instead, we use previous cluster cell lineage information for integrating data based on cluster markers and data harmonization. After integration, it is possible to visualize cell interactions and correlations in many ways, including cell distance, correlations, and more.

Despite this, it is also able to conduct DEG analysis between sets, selected cells, or grouped cells, and visualize the results on UMAP, volcano plots, and regression plots comparing pairs of cells. It is very powerful for more advanced analyses focusing on specific issues within the data that may not be discovered in basic analyses.

Additionally, JDtI offers many functions for data visualization and processing within clean visual outputs, such as volcano plots, gene expression analysis of different data types, clustering, heatmaps, and more.

drawing

It is compatible with various sequencing approaches, including scRNA-seq and bulk RNA-seq, and supports interoperability with tools such as Seurat, Scanpy, and other bioinformatics frameworks using the 10x sparse matrix format as input. More details about the available functions can be found in the Documentation and Example Usage section on GitHub.


Table of contents

Installation Documenation

Example usage: 1. Basic functions 2. Data clustering 3. Data integration 4. Data subclustering


Installation

pip install jdti

Documentation

Documentation for classes and functions is available here 👉 Documentation 📄


Example usage

1. Basic functions

1.1. Loading functions
from jdti import *
1.2. Loading data
# load sparse matrix as pd.DataFrame data with creating metadata
data, metadata = load_sparse(path = 'data/set1', name = 'set1')

#load data frame from different data type (.tsv, .txt, .tsv)
data = pd.read_csv('example_data.csv')

# load data from .h5 or other data types and transform to pandas data frame
  • Data [features (eg. genes) x sample (eg. cells)]
  • Metadata [columns['cell_names', 'sets']]:
    • cell_names – sample names corresponding to the columns of Data
    • sets – the assignment of each sample to a given dataset, aligned with Data
1.3. Features finding
features = find_features(data, features =['KIT', 'MC1', 'EDNRB', 'PAX3'])
  • Not found the MC1 feature name, so the potential names are provided

drawing

features = find_features(data, features =['KIT', 'MC1R', 'EDNRB', 'PAX3'])
  • All feature names have been found

drawing

1.4. Names finding
names = find_names(data, names = ['0', '1', '2','10', '1&'])
  • As same as in case of 'Features finding'

drawing

1.5. Data reducing
# data reducing on found features and names

data_reduced = reduce_data(data,
                features = features['included'],
                names = names['included'])
  • return data with selected features & names
1.6. Data averaging and occurrence counting
avg_reduced = average(data_reduced)
occ_reduced = occurrence(data_reduced)
  • returns the average or occurrence values computed across all columns that share the same name
1.7. Difference counting (DEG) and visualization
# creating group dict for compare samples
compare_dict = {'g1':['0', '1'],
                'g2':['2','10']}


deg_df = calc_DEG(data, 
             metadata_list  = None, 
             entities = compare_dict, 
             sets = None, 
             min_exp = 0, 
             min_pct = 0.1, 
             n_proc =10)

# DEG visualization with volcano plot
fig = volcano_plot(deg_df, 
                 p_adj = True, 
                 top = 25, 
                 p_val = 0.05, 
                 lfc = 0.25, 
                 standard_scale = False, 
                 rescale_adj = True, 
                 image_width = 12, 
                 image_high = 12)

fig.savefig('volcano.jpeg', dpi=300, bbox_inches='tight')
  • DEG data:
    • featureName of the studied feature
    • p_valP-value (Mann–Whitney) for the studied feature comparing the valid_group to all other groups in the analysis
    • pct_validPercentage of positive (>0) values for the studied feature in the valid_group*
    • pct_ctrlPercentage of positive (>0) values for the studied feature in all other groups
    • avg_validAverage value of the studied feature in the valid_group
    • avg_ctrlAverage value of the studied feature in the remaining groups
    • sd_validStandard deviation of the studied feature in the valid_group*
    • sd_ctrlStandard deviation of the studied feature in the remaining groups
    • esmCohen’s d effect size metric
    • valid_groupName of the sample or group belonging to the valid_group
    • adj_pvalBenjamini–Hochberg adjusted p-value
    • FCFold change between the averaged valid_group samples and the averaged remaining samples
    • log(FC)Log₂-transformed fold change
    • norm_diffDirect difference between the averaged valid_group value and the averaged value of the remaining groups

  • Volcano plotVisualization of differentially expressed genes (DEGs) between two groups

drawing

1.8. Features visualization
top_10 = deg_df.sort_values(
    ['p_val', 'esm', 'log(FC)'], 
    ascending=[True, False, False]).head(10)

data_scatter = reduce_data(data,
                features = list(set(top_10['feature'])),
                names = names['included'])



avg = average(data_scatter)
occ = occurrence(data_scatter)


fig = features_scatter(expression_data = avg, 
                     occurence_data = occ,
                     features = None, 
                     metadata_list = None, 
                     colors = 'viridis', 
                     hclust = 'complete', 
                     img_width = 8, 
                     img_high = 5, 
                     label_size = 10, 
                     size_scale = 100,
                     y_lab = 'Genes', 
                     legend_lab = 'log(CPM + 1)',
                     bbox_to_anchor_scale = 25,
                     bbox_to_anchor_perc=(0.91, 0.55),
                     bbox_to_anchor_group=(1.01, 0.4))

fig.savefig('scatter.jpeg', dpi=300, bbox_inches='tight')
  • Scatter plotDisplays expression relationships of DEGs across groups or individual samples

drawing

1.9. Relation visualization
fig = development_clust(data = avg, 
                      method = 'ward',
                      img_width = 5,
                      img_high = 5)

fig.savefig('development.jpeg', dpi=300, bbox_inches='tight')
  • Development plotA dendrogram showing sample similarity based on the expression features generated using hierarchical clustering

drawing

2. Data clustering

from jdti import Clustering, load_sparse
data, metadata = load_sparse(path = 'data/set2', name = 'set2')
clusters = Clustering.add_data_frame(data, metadata)
clusters.clustering_data
clusters.clustering_metadata
clusters.perform_PCA(pc_num=100, width=8, height=6)

clusters.knee_plot_PCA(width=8, height=6)
clusters.harmonize_sets(harmonize_type='harmony')
clusters.find_clusters_PCA(pc_num=0, eps=0.5, min_samples=10, width=8, height=6, harmonized=False)
clusters.perform_UMAP(factorize=False, umap_num=0, pc_num=5, harmonized=False)


clusters.knee_plot_umap(eps=0.5, min_samples=10)
clusters.find_clusters_UMAP(umap_n=5, eps=0.5, min_samples=10, width=8, height=6)


clusters.UMAP_vis(names_slot='cell_names', set_sep=True, point_size=0.6)
clusters.UMAP_feature(feature_name = 'KIT', features_data=None, point_size=0.6)
clusters.get_umap_data()

clusters.get_pca_data()

clusters.return_clusters(clusters='umap')

3. Data integration


from jdti import COMPsc, volcano_plot
jseq_object = COMPsc.project_dir('data', ['set1', 'set2'])

jseq_object.load_sparse_from_projects(normalized_data=True)
dt = jseq_object.get_partial_data(names=['10'], features=['KIT', 'PAX3', 'MITF'], name_slot='cell_names')
jseq_object.gene_histograme(bins=100)

jseq_object.gene_threshold(min_n = 50, max_n = 3000)

jseq_object.gene_histograme(bins=100)

jseq_object.reduce(reg = '5', inc_set = False)

jseq_object.gene_histograme(bins=100)
jseq_object.cell_histograme(name_slot = 'cell_names')

jseq_object.cluster_threshold(min_n = 20, name_slot = 'cell_names')

jseq_object.cell_histograme(name_slot = 'cell_names')
# returny

met = jseq_object.input_metadata

data = jseq_object.get_data(set_info=True) 

metadata = jseq_object.get_metadata()
jseq_object.calculate_difference_markers(min_exp = 0, 
                                         min_pct = 0.25, 
                                         n_proc=10, 
                                         force = False)



jseq_object.estimating_similarity(method = 'pearson', 
                                  p_val = 0.05,
                                  top_n = 10)
    

pl = jseq_object.similarity_plot(split_sets = True, 
                                 set_info = True,
                                 cmap='seismic', 
                                 width = 16, height = 14)

   
# pl.savefig(f'sim_plot_top_{top}.svg', dpi=300, bbox_inches='tight')

pl2 = jseq_object.spatial_similarity(set_info= True, bandwidth = 1, n_neighbors = 5,
min_dist = 0.1, legend_split = 2, point_size = 20, spread=1.0,
set_op_mix_ratio=1.0,
local_connectivity=1,
repulsion_strength=1.0,
negative_sample_rate=5,
width = 12, height = 10)

pl2.savefig(f'sim_plot_map_top_{top}.svg', dpi=300, bbox_inches='tight')


sim_data = jseq_object.similarity sim_data = sim_data[sim_data['set1'] != sim_data['set2']]

jseq_object.cell_regression( cell_x = '2', cell_y = '6', set_x = 'set1', set_y = 'set2', threshold = 6, image_width = 12, image_high = 7, color = 'black')


jseq_object.clustering_features(name_slot = 'cell_names', features_list = None, p_val = 0.05, top_n = 10, adj_mean = False, beta = 0.2)

jseq_object.perform_PCA(pc_num = 50)

jseq_object.knee_plot_PCA()

jseq_object.harmonize_sets(harmonize_type = 'harmony')

# jseq_object.find_clusters_PCA(pc_num = 100, eps = 0.5, min_samples = 10)

jseq_object.perform_UMAP(factorize=False, umap_num = 2, pc_num = 10, harmonized = True)


# jseq_object.knee_plot_umap(eps = 0.5, min_samples = 10)


# jseq_object.find_clusters_UMAP(umap_n = 6, eps = 1, min_samples = 20)


plu = jseq_object.UMAP_vis( 
             names_slot = 'cell_names', 
             set_sep = True,
             point_size = 1,
             font_size = 6,
             legend_split_col = 2,
             width = 8,
             height = 6,
             inc_num = True)

# plu.savefig(f'sim_umap_top.svg', dpi=300, bbox_inches='tight')


plu = jseq_object.UMAP_vis( 
             names_slot = 'sets', 
             set_sep = True,
             point_size = 1,
             font_size = 6,
             legend_split_col = 1,
             width = 8,
             height = 6,
             inc_num = False)

# plu.savefig(f'sim_umap_sets_top_.svg', dpi=300, bbox_inches='tight')

vis = jseq_object.UMAP_feature( 
             features_data = jseq_object.get_data(set_info = False) ,
             feature_name = 'MAP1B',
             point_size = 0.6,
             font_size = 6,
             width = 8,
             height = 6,
             palette = 'light')

# vis.savefig(f'sim_umap_sets_top_vis.svg', dpi=300, bbox_inches='tight')

jseq_object.var_data


# jseq_object.save_project(name = 'topola')

stats = jseq_object.statistic(cells=None, sets='All', min_exp=0, min_pct=0.025, n_proc=10)
stats_5 = stats.sort_values(['valid_group', 'esm', 'log(FC)'], ascending=[True, False, False]).groupby('valid_group').head(5)



fig = volcano_plot(stats)
jseq_object.scatter_plot(
                 names = None,
                 features = list(set(stats_5['feature'])),
                 name_slot = 'cell_names',
                 scale = False,
                 colors = 'viridis', 
                 hclust = 'complete', 
                 img_width  = 15, 
                 img_high  = 3, 
                 label_size = 10, 
                 size_scale = 200,
                 x_lab = 'Genes', 
                 legend_lab = 'log(CPM + 1)',
                 set_box_size = 5,
                 set_box_high = 0.1,
                 bbox_to_anchor_scale = 25,
                 bbox_to_anchor_perc=(0.90, 0.5),
                 bbox_to_anchor_group=(0.9, 0.3))

import re

jseq_object.data_composition( 
                     features_count = list(set([re.sub(r' .*$', '',x) for x in list(set(jseq_object.input_metadata['cell_names']))])),
                     name_slot = 'cell_names',
                     set_sep = True
                     )


jseq_object.composition_pie( 
                    width = 6, 
                    height = 6, 
                    font_size = 15,
                    cmap  = "tab20",
                    legend_split_col = 1,
                    offset_labels = 0.5,
                    legend_bbox = (1.15, 0.95))


jseq_object.bar_composition( 
                    cmap = 'tab20b', 
                    width = 2, 
                    height = 6, 
                    font_size = 15,
                    legend_split_col = 1,
                    legend_bbox = (1.3, 1))


4. Data subclustering

from jdti import COMPsc
jseq_object = COMPsc.project_dir('data', ['set2'])
jseq_object.load_sparse_from_projects(normalized_data=True)
jseq_object.subcluster_prepare(features = ['HMGCS1', 'MAP1B', 'SOX4'], 
                               cluster='10')
jseq_object.define_subclusters( 
                          umap_num = 5,
                          eps = 1, 
                          min_samples = 5,
                          n_neighbors = 5,  
                          min_dist = 0.1, 
                          spread = 1.0,              
                          set_op_mix_ratio = 1.0,    
                          local_connectivity = 1,    
                          repulsion_strength = 1.0,  
                          negative_sample_rate = 5,  
                          width = 8, 
                          height = 6)
  
jseq_object.subcluster_features_scatter(
                                        colors = 'viridis', 
                                        hclust = 'complete', 
                                        img_width = 3, 
                                        img_high = 5, 
                                        label_size = 6, 
                                        size_scale = 70,
                                        x_lab = 'Genes', 
                                        legend_lab = 'normalized')
    
mapping = {
    "old_name": ["-1", "1", "4"],
    "new_name": ["1", "1", "1"]
}

jseq_object.rename_subclusters(mapping)

jseq_object.subcluster_DEG_scatter(
                                    top_n = 3,
                                    min_exp = 0, 
                                    min_pct = 0.1, 
                                    p_val = 0.05,
                                    colors = 'viridis', 
                                    hclust = 'complete', 
                                    img_width = 3, 
                                    img_high = 5, 
                                    label_size = 6, 
                                    size_scale = 70,
                                    x_lab = 'Genes', 
                                    legend_lab = 'normalized',
                                    n_proc=10)
    
 
jseq_object.accept_subclusters()
l = set(jseq_object.input_metadata['cell_names'])
        

Have fun JBS

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jdti-0.1.2.tar.gz (46.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jdti-0.1.2-py3-none-any.whl (47.0 kB view details)

Uploaded Python 3

File details

Details for the file jdti-0.1.2.tar.gz.

File metadata

  • Download URL: jdti-0.1.2.tar.gz
  • Upload date:
  • Size: 46.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.10

File hashes

Hashes for jdti-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0688642bcb7c2762f8268b763e76f168602af7ca5523bf4f0367517e231b3cac
MD5 3302ad5ccbe48df7db74b7cafc9eea2e
BLAKE2b-256 e83834a3016f2006bc8686ad7d7967cc1063ffcd372d31bb26379fe595cf2d18

See more details on using hashes here.

File details

Details for the file jdti-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: jdti-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 47.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.10

File hashes

Hashes for jdti-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 41b8724c40ec780136fc7dbc7b1ea7bb481973bb5425875f39560991d743462a
MD5 31c31357df21e1b21c6b9c1df120a26a
BLAKE2b-256 f555fc6c8f377d4d052556bfa8ab05c83909c1ea34b1e68d914158a9a6817c00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page