Skip to main content

Chen and Yang Lab Multi fork Development cell lineage tree alignment

Project description

mDELTA

logo
  • mDELTA is an algorithm for multifuricating Developmental cEll Lineage Tree Alignment. In essence, it compares two rooted, unordered, tip-labeled trees, and finds the best global / local correspondence between the nodes. The mDELTA program is designed for analyzing developmental cell lineage trees reconstructed through single-cell DNA barcoding (such as done by scGESTALT or SMALT, while greater cellular coverage is expected to yield more meaningful mDELTA alignments).
  • Except for dealing with cell lineage trees instead of biological sequences, mDELTA is conceptually similar to sequence alignment. It helps quantify similarity among different lineage trees, disentangle the consensus and variation, find recurrent motifs, and facilitate comparative/evolutionary analyses.
  • Also included in this repository are Python/R scripts for statistical analyses and visualization of mDELTA results, which facilitates their biological interpretation.
  • mDELTA was developed by Jingyu Chen under the supervision of Professor Jian-Rong Yang at the Zhongshan School of Medicine of Sun Yat-Sen University in China.

Quick start

Installation

pip install modelta

This will install mDELTA and its prerequisites

Running the program

mDELTA.py -h

Aligning a tree with itself, output top three local alignments

mDELTA.py ExampleFile/tree.nwk ExampleFile/tree.nwk -t 3

Required package

  • pandas: Score matrix architecture based on dataframe.
  • numpy: Many computing essential packages.
  • munkres: An algorithm for finding the maximum value of score matrix dynamic programming

Optional package

  • tqdm: Displays the progress during the calculation phase.
  • multiprocess: When calculating the p value, because it needs to disrupt the original sequence many times and perform multiple calculations, using multiple processes can effectively reduce the waiting time.

Source code install

(1) Offline
Step1: $git clone https://github.com/Chenjy0212/modelta.git
Step2: $cd modelta -> run "python setup.py install"
(2) Online
$pip install git+https://github.com/Chenjy0212/modelta.git@main

For python coder user ↓

You can use this package in your Python code. For example, run under Jupiter notebook:

import modelta
from pprint import pprint
#前两项为必选项,后者皆为可选
example = modelta.scoremat(TreeSeqFile = 'ExampleFile/tree.nwk',  
                       TreeSeqFile2 = 'ExampleFile/tree.nwk',      
                       Name2TypeFile = 'ExampleFile/Name2Type.csv',
                       Name2TypeFile2 ='ExampleFile/Name2Type.csv',
                       ScoreDictFile = 'ExampleFile/TypeTypeScore.csv',
                       top = 3,
                       notebook = 1,
                       overlap = 5,
                          )

Result

Matrix Node: |██████████| 121/121 100%
121/121 [00:00<00:00, 2573.11it/s]
{'TopScoreList': [{'Root1_label': 'root',
                   'Root1_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)),a);',
                   'Root1_node': '(((a,b,c),d,(e,f)),a)',
                   'Root1_prune': [],
                   'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)),a);',
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': [],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 21.0,
                   'col': 10,
                   'row': 10},
                  {'Root1_label': '0',
                   'Root1_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)));',
                   'Root1_node': '((a,b,c),d,(e,f))',
                   'Root1_prune': [],
                   'Root1_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)));',
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': ['1'],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 17.0,
                   'col': 10,
                   'row': 9},
                  {'Root1_label': '0,0',
                   'Root1_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root1_match_tree': '(((a,b,c),d,(e,f)),(a,b,c));',
                   'Root1_node': '(a,b,c)',
                   'Root1_prune': [],
                   'Root1_seq': '(a1,a2,a3)',
                   'Root2_label': '0',
                   'Root2_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root2_match_tree': '(((a,b,c),d,(e,f)),(a,b,c));',
                   'Root2_node': '((a,b,c),d,(e,f))',
                   'Root2_prune': ['0,1', '0,2,0', '0,2,1'],
                   'Root2_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Score': 6.0,
                   'col': 9,
                   'row': 7}],
 'matrix': Root2  0,0,0  0,0,1  0,0,2  0,1  0,2,0  0,2,1    1  0,0  0,2     0  root
Root1                                                                   
0,0,0    3.0    2.0    2.0  1.0    1.0    0.0  3.0  1.0  0.0  -1.0  -1.0
0,0,1    2.0    3.0    1.0 -1.0   -1.0   -1.0  2.0  1.0 -1.0  -1.0  -1.0
0,0,2    2.0    1.0    3.0 -1.0   -1.0   -1.0  2.0  1.0 -1.0  -1.0  -1.0
0,1      1.0   -1.0   -1.0  3.0    0.0   -1.0  1.0 -1.0 -1.0  -1.0  -1.0
0,2,0    1.0   -1.0   -1.0  0.0    3.0   -1.0  1.0 -1.0  2.0  -1.0  -1.0
0,2,1    0.0   -1.0   -1.0 -1.0   -1.0    3.0  0.0 -1.0  2.0  -1.0  -1.0
1        3.0    2.0    2.0  1.0    1.0    0.0  3.0  1.0  0.0  -1.0  -1.0
0,0      1.0    1.0    1.0 -1.0   -1.0   -1.0  1.0  9.0 -1.0   6.0   5.0
0,2      0.0   -1.0   -1.0 -1.0    2.0    2.0  0.0 -1.0  6.0   2.0   1.0
0       -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  6.0  2.0  18.0  17.0
root    -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  5.0  1.0  17.0  21.0,
 'score_dict': {'a_a': 3.0,
                'a_b': 2.0,
                'a_c': 2.0,
                'a_d': 1.0,
                'a_e': 1.0,
                'a_f': 0.0,
                'b_a': 2.0,
                'b_b': 3.0,
                'b_c': 1.0,
                'b_d': -1.0,
                'b_e': -1.0,
                'b_f': -1.0,
                'c_a': 2.0,
                'c_b': 1.0,
                'c_c': 3.0,
                'c_d': -1.0,
                'c_e': -1.0,
                'c_f': -1.0,
                'd_a': 1.0,
                'd_b': -1.0,
                'd_c': -1.0,
                'd_d': 3.0,
                'd_e': 0.0,
                'd_f': -1.0,
                'e_a': 1.0,
                'e_b': -1.0,
                'e_c': -1.0,
                'e_d': 0.0,
                'e_e': 3.0,
                'e_f': -1.0,
                'f_a': 0.0,
                'f_b': -1.0,
                'f_c': -1.0,
                'f_d': -1.0,
                'f_e': -1.0,
                'f_f': 3.0},
 'tree1_leaves_celltype': 'a;b;c;d;e;f;a',
 'tree1_leaves_label': '0,0,0;0,0,1;0,0,2;0,1;0,2,0;0,2,1;1',
 'tree1_leaves_nodename': 'a1;a2;a3;a4;a5;a6;a1',
 'tree2_leaves_celltype': 'a;b;c;d;e;f;a',
 'tree2_leaves_label': '0,0,0;0,0,1;0,0,2;0,1;0,2,0;0,2,1;1',
 'tree2_leaves_nodename': 'a1;a2;a3;a4;a5;a6;a1'}

Parameter analysis

If the parameter has an *, it is required; otherwise, it is optional

  • TreeSeqFile & TreeSeqFile2: [path/filename *] Cell lineage tree file with branch length information removed. The format of reference documents is as follows: ExampleFile/tree.nwk
  • mv: [float and default = 2.] The matching score between the same nodes, which is often used when the parameter ScoreDictFile is the default.
  • pv: [float and default = -1.] The prune score between the different nodes.
  • top: [int > 0 and default = 0] Select the top few meaningful scores in the score matrix. if it is default:
{'T1root_T2root': [{'Root1_label': 'root',
                    'Root1_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root1_match_tree': '(((a,b,c),d,(e,f)),a);',
                    'Root1_node': '(((a,b,c),d,(e,f)),a)',
                    'Root1_prune': [],
                    'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Root2_label': 'root',
                    'Root2_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root2_match_tree': '(((a,b,c),d,(e,f)),a);',
                    'Root2_node': '(((a,b,c),d,(e,f)),a)',
                    'Root2_prune': [],
                    'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Score': 21.0,
                    'col': 10,
                    'row': 10}],            
  • notebook: [bool and default=False] Is it written and run in the jupyter notebook environment.
  • Tqdm: [bool and default=True] Whether to display the operation progress bar.
  • overlap: [int > 0 and default = 0] In the local results, the later comparison results cannot have X% or more node pairs that duplicate the previous results.
  • merge: [int > 0 and default = 0] Merge internal node to prune.

if Qualitative calculation:

  • Name2TypeFile & Name2TypeFile2: [path/filename *] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csv
  • ScoreDictFile: [path/filename and default=''] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/socrefile.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
If the file is empty, only the same nodes are taken for pairing, and the default matching score is 2 (float)

node: a <-> a = 2.(custom)
      b <-> b = 3.(custom)
      a <-> b = ?(custom)
The higher the score, the stronger the similarity

If Quantitative calculation

  • ScoreDictFile: [path/filename *] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/Qscorefile.csv
  • Name2TypeFile & Name2TypeFile2: [path/filename or No input] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
The file is required. You can modify the score of the same node by modifying parameter "mv"

   Gene0  Gene1  Gene2  
a    1      2      3  
b    2      3      4

node: (1-2)**2 + (2-3)**2 + (3-4)**2 #Euclidean distance
Then get the final score according to the smoothing function. 
The lower the score, the stronger the similarity

P-value calculation

modelta.pvalue(times = 3, 
               topscorelist = example['TopScoreList'], 
               ScoreDictFile='',
               CPUs = 50, 
               mv = 2, 
               pv = -1)

Result

 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.05it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.38it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.45it/s]
[[3.0, 4.0, 0.0, 14.0], [4.0, 5.0, 3.0, 11.0], [5.0, 0.0, 1.0, 11.0]]

The returned results represent times matching scores corresponding to the top maximum values

Parameter analysis

If the parameter has an *, it is required; otherwise, it is optional

  • times: [int > 0 *] The number of times the original sequence needs to be disrupted, such as:
times = 3 #Randomly disrupt the nodes, but the structure remains unchanged
(((a,b,c),d,(e,f)),a) -> (((a,b,c),d,(e,f)),a)
                      -> (((a,c,d),b,(a,f)),e)
                      -> (((e,f,a),d,(b,c)),a)
  • topscorelist: [example['TopScoreList'] *] The input parameter is the maximum value sequence obtained earlier.
  • CPUs: [int > 0 and default = 50] Multi process computing can greatly reduce the waiting time. The default process pool is 50, but limited by local computer resources, it can reach the maximum number of local CPU cores - 1.
  • mv & pv & notebook & Tqdm & overlap parameters have been described in detail before

For Ordinary user ↓

Quick Start

Running the program

mDELTA.py -h

Windows

mDELTA.py ./ExampleFile/tree.nwk ./ExampleFile/tree.nwk -t 3

Linux

./mDELTA.py ../ExampleFile/tree.nwk ../ExampleFile/tree.nwk -t 3

Help

Windows: $mDELTA.py -h
 Linux:  $./mDELTA.py -h
usage: mDELTA [-h] [-nt NAME2TYPEFILE] [-nt2 NAME2TYPEFILE2] [-sd SCOREDICTFILE] [-t TOP] [-ma MAV] [-mi MIV] [-p PV] [-T TQDM]
              [-n NOTEBOOK] [-P PERM] [-a ALG] [-c CPUS] [-o OUTPUT] [-x DIFF] [-mg MERGE]
              TreeSeqFile TreeSeqFile2

Multifuricating Developmental cEll Lineage Tree Alignment(mDELTA)

positional arguments:
  TreeSeqFile           [path/filename] A text file storing cell lineage tree #1 in newick format. Tips can be labeled by name or
                        cell type. Branch lengths should be removed.
  TreeSeqFile2          [path/filename] A text file storing cell lineage tree #2 in newick format. Tips can be labeled by name or
                        cell type. Branch lengths should be removed.

optional arguments:
  -h, --help            show this help message and exit
  -nt NAME2TYPEFILE, --Name2TypeFile NAME2TYPEFILE
                        [path/filename] List of correspondance between tip name and cell type for cell lineage tree #1.
  -nt2 NAME2TYPEFILE2, --Name2TypeFile2 NAME2TYPEFILE2
                        [path/filename] List of correspondance between tip name and cell type for cell lineage tree #2.
  -sd SCOREDICTFILE, --ScoreDictFile SCOREDICTFILE
                        [path/filename] A comma-delimited text file used to determine similarity scores between cells. If there
                        are exactly three columns, they will be interpreted as (1) the cell (name or type) in Tree #1, (2) the
                        cell in Tree #2, and (3) the similarity score. If otherwise, the first column will be interpreted as the
                        cell (name or type) and the remaining columns as features of the cell (e.g. expression of a gene). The
                        similarity scores will be estimated between all pairs of cells based on the Euclidean distance calculated
                        using all the features. Overrides `-ma` and `-mi`.
  -t TOP, --top TOP     [int > 0] Performs local (instead of global) alignment, and output the top NUM local alignments with the
                        highest score (e.g. `-t 10`). In the case of global alignment, this parameter should be omitted.
  -ma MAV, --mav MAV    [float]
  -mi MIV, --miv MIV    [float] Shorthand for a simple matching score scheme, where the matching score between a pair of the same
                        cell types is MAV and all other pairs are MIV. (e.g. `-ma 2 -mi -2`). Overridden by `-sd`.
  -p PV, --pv PV        [float] The score for pruning a tip of the tree (e.g. `-p -2`). Default to -1.
  -T TQDM, --Tqdm TQDM  [0(off) or 1(on)] Toggle for the jupyter notebook environment.
  -n NOTEBOOK, --notebook NOTEBOOK
                        [0(off) or 1(on)] Toggle for the jupyter notebook environment.
  -P PERM, --PERM PERM  [int > 0] Toggle for the statistical significance. For each observed alignment, the aligned trees will be
                        permuted PERM times to generate a null distribution of alignment scores, with which a P value can be
                        calculated for the observed alignment score.
  -a ALG, --Alg ALG     [KM / GA] Use Kuhn-Munkres or Greedy Algorithm to find the optimal alignment score.
  -c CPUS, --CPUs CPUS  [int > 0] Number of threads for multi-processing. Default to 50., it can reach the maximum number of local
                        CPU cores - 1.
  -o OUTPUT, --output OUTPUT
                        [path/filename] Output filename
  -x DIFF, --diff DIFF  [int > 0] Alignment must consist of a minimal of DIFF{'option_strings': ['-x', '--diff'], 'dest': 'diff',
                        'nargs': None, 'const': None, 'default': 0, 'type': 'int', 'choices': None, 'required': False, 'help': '
                        [int > 0] Alignment must consist of a minimal of DIFF% aligned cell pairs that are different from
                        previous(better) local alignments in order to be considered as another new alignment (e.g. `-x 20` means
                        20 persent).', 'metavar': None, 'container': <argparse._ArgumentGroup object at 0x0000024B6EEA3E80>,
                        'prog': 'mDELTA'}ligned cell pairs that are different from previous(better) local alignments in order to
                        be considered as another new alignment (e.g. `-x 20` means 20 persent).
  -mg MERGE, --merge MERGE
                        [float] This is the scaling factor for calculating the score of merging an internal node (e.g. -mg -1),
                        which is multiplied by the number of tips of the internal node to be merged. Default to 0.

More details on https://github.com/Chenjy0212/modelta

Citation

If you use this project in your research, please cite this project.

@misc{modelta2022,
    author = {Jingyu Chen},
    title = {mDELTA: Multifuricating Developmental cEll Lineage Tree Alignment},
    year = {2022},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Chenjy0212/modelta}},
}

Introduction

Student of @SYSU. :school:

Undergraduate majoring in computer science, master majoring in bioinformatics. :man_technologist:

I hope my program can be helpful to your research. :heart:

How to contact the author has been written at the top. :eyes:

sysulogo

PyPI - Python Version PyPI license
Github Stars Bilibili Zhihu Weibo neteasy-mysic douyin instagram
QQ wechat mail gmail
sysu

Update

2022-07-27

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelta-1.0.30.tar.gz (59.7 kB view hashes)

Uploaded Source

Built Distribution

modelta-1.0.30-py3-none-any.whl (37.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page