Skip to main content

Chen and Yang Lab Multi fork Development cell lineage tree alignment

Project description

logo

PyPI - Python Version PyPI license
Github Stars Bilibili Zhihu Weibo neteasy-mysic douyin instagram
QQ wechat mail gmail
sysu

mDELTA

  • Yang Lab Multifuricating Developmental cEll Lineage Tree Alignment (mDELTA) algorithm package and executable program.

  • You can get the score matrix through them to analyze the node relationship of the pedigree tree, or test the correlation.

  • You can star this repository to keep track of the project if it's helpful for you, thank you for your support.

Install

Required package

  • pandas: Score matrix architecture based on dataframe.
  • numpy: Many computing essential packages.
  • munkres: An algorithm for finding the maximum value of score matrix dynamic programming

Optional package

  • tqdm: Displays the progress during the calculation phase.
  • multiprocess: When calculating the p value, because it needs to disrupt the original sequence many times and perform multiple calculations, using multiple processes can effectively reduce the waiting time.

Pip install

$"pip install modelta"

Source code install

(1) Offline
Step1: $git clone https://github.com/Chenjy0212/modelta.git
Step2: $cd modelta -> run "python setup.py install"
(2) Online
$pip install git+https://github.com/Chenjy0212/modelta.git@main

For python coder user ↓

Quick Start

You can use this package in your Python code. For example, run under Jupiter notebook:

import modelta
from pprint import pprint

example = modelta.scoremat(TreeSeqFile = 'ExampleFile/tree.nwk',
                       TreeSeqFile2 = 'ExampleFile/tree.nwk',
                       Name2TypeFile = 'ExampleFile/Name2Type.csv',
                       Name2TypeFile2 ='ExampleFile/Name2Type.csv',
                       top = 3,
                       notebook = 1,
                       overlap = 5,)
pprint(example)

Result

Matrix Node: |██████████| 121/121 100%
121/121 [00:00<00:00, 2573.11it/s]
{'TopScoreList': [{'Root1_label': 'root',
                   'Root1_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_node': '(((a,b,c),d,(e,f)),a)',
                   'Root1_prune': [],
                   'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '1',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': [],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 14.0,
                   'col': 10,
                   'row': 10},
                  {'Root1_label': '0',
                   'Root1_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root1_node': '((a,b,c),d,(e,f))',
                   'Root1_prune': ['1'],
                   'Root1_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Root2_label': 'root',
                   'Root2_match': ['0',
                                   '0,0',
                                   '0,1',
                                   '0,2',
                                   '0,0,0',
                                   '0,0,1',
                                   '0,0,2',
                                   '0,2,0',
                                   '0,2,1'],
                   'Root2_node': '(((a,b,c),d,(e,f)),a)',
                   'Root2_prune': ['1'],
                   'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                   'Score': 11.0,
                   'col': 10,
                   'row': 9},
                  {'Root1_label': '0,0',
                   'Root1_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root1_node': '(a,b,c)',
                   'Root1_prune': ['0,1', '0,2,0', '0,2,1'],
                   'Root1_seq': '(a1,a2,a3)',
                   'Root2_label': '0',
                   'Root2_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
                   'Root2_node': '((a,b,c),d,(e,f))',
                   'Root2_prune': ['0,1', '0,2,0', '0,2,1', '1'],
                   'Root2_seq': '((a1,a2,a3),a4,(a5,a6))',
                   'Score': 3.0,
                   'col': 9,
                   'row': 7}],
 'matrix': Root2  0,0,0  0,0,1  0,0,2  0,1  0,2,0  0,2,1    1  0,0  0,2     0  root
Root1                                                                   
0,0,0    2.0   -1.0   -1.0 -1.0   -1.0   -1.0  2.0  0.0 -1.0  -1.0  -1.0
0,0,1   -1.0    2.0   -1.0 -1.0   -1.0   -1.0 -1.0  0.0 -1.0  -1.0  -1.0
0,0,2   -1.0   -1.0    2.0 -1.0   -1.0   -1.0 -1.0  0.0 -1.0  -1.0  -1.0
0,1     -1.0   -1.0   -1.0  2.0   -1.0   -1.0 -1.0 -1.0 -1.0  -1.0  -1.0
0,2,0   -1.0   -1.0   -1.0 -1.0    2.0   -1.0 -1.0 -1.0  1.0  -1.0  -1.0
0,2,1   -1.0   -1.0   -1.0 -1.0   -1.0    2.0 -1.0 -1.0  1.0  -1.0  -1.0
1        2.0   -1.0   -1.0 -1.0   -1.0   -1.0  2.0  0.0 -1.0  -1.0  -1.0
0,0      0.0    0.0    0.0 -1.0   -1.0   -1.0  0.0  6.0 -2.0   3.0   2.0
0,2     -1.0   -1.0   -1.0 -1.0    1.0    1.0 -1.0 -2.0  4.0   0.0  -1.0
0       -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  3.0  0.0  12.0  11.0
root    -1.0   -1.0   -1.0 -1.0   -1.0   -1.0 -1.0  2.0 -1.0  11.0  14.0}

Parameter analysis

If the parameter has an *, it is required; otherwise, it is optional

  • TreeSeqFile & TreeSeqFile2: [path/filename *] Cell lineage tree file with branch length information removed. The format of reference documents is as follows: ExampleFile/tree.nwk
  • mv: [float and default = 2.] The matching score between the same nodes, which is often used when the parameter ScoreDictFile is the default.
  • pv: [float and default = -1.] The prune score between the different nodes.
  • top: [int > 0 and default = 0] Select the top few meaningful scores in the score matrix. if it is default:
{'T1root_T2root': [{'Root1_label': 'root',
                    'Root1_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root1_node': '(((a,b,c),d,(e,f)),a)',
                    'Root1_prune': [],
                    'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Root2_label': 'root',
                    'Root2_match': ['0',
                                    '1',
                                    '0,0',
                                    '0,1',
                                    '0,2',
                                    '0,0,0',
                                    '0,0,1',
                                    '0,0,2',
                                    '0,2,0',
                                    '0,2,1'],
                    'Root2_node': '(((a,b,c),d,(e,f)),a)',
                    'Root2_prune': [],
                    'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
                    'Score': 14.0,
                    'col': 10,
                    'row': 10}],
                    
                    ......

}                   
  • notebook: [bool and default=False] Is it written and run in the jupyter notebook environment.
  • Tqdm: [bool and default=True] Whether to display the operation progress bar.
  • overlap: [int > 0 and default = 0] In the local results, the later comparison results cannot have X% or more node pairs that duplicate the previous results.

if Qualitative calculation:

  • Name2TypeFile & Name2TypeFile2: [path/filename *] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csv
  • ScoreDictFile: [path/filename and default=''] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/socrefile.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
If the file is empty, only the same nodes are taken for pairing, and the default matching score is 2 (float)

node: a <-> a = 2.(custom)
      b <-> b = 3.(custom)
      a <-> b = ?(custom)
The higher the score, the stronger the similarity

If Quantitative calculation

  • ScoreDictFile: [path/filename *] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/Qscorefile.csv
  • Name2TypeFile & Name2TypeFile2: [path/filename or No input] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
The file is required. You can modify the score of the same node by modifying parameter "mv"

   Gene0  Gene1  Gene2  
a    1      2      3  
b    2      3      4

node: (1-2)**2 + (2-3)**2 + (3-4)**2 #Euclidean distance
Then get the final score according to the smoothing function. 
The lower the score, the stronger the similarity

P-value calculation

modelta.pvalue(times = 3, 
               topscorelist = example['TopScoreList'], 
               ScoreDictFile='',
               CPUs = 50, 
               mv = 2, 
               pv = -1)

Result

 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.05it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.38it/s]
 Pvalue : 100%|██████████| 3/3 [00:00<00:00,  4.45it/s]
[[3.0, 4.0, 0.0, 14.0], [4.0, 5.0, 3.0, 11.0], [5.0, 0.0, 1.0, 11.0]]

The returned results represent times matching scores corresponding to the top maximum values

Parameter analysis

If the parameter has an *, it is required; otherwise, it is optional

  • times: [int > 0 *] The number of times the original sequence needs to be disrupted, such as:
times = 3 #Randomly disrupt the nodes, but the structure remains unchanged
(((a,b,c),d,(e,f)),a) -> (((a,b,c),d,(e,f)),a)
                      -> (((a,c,d),b,(a,f)),e)
                      -> (((e,f,a),d,(b,c)),a)
  • topscorelist: [example['TopScoreList'] *] The input parameter is the maximum value sequence obtained earlier.
  • CPUs: [int > 0 and default = 50] Multi process computing can greatly reduce the waiting time. The default process pool is 50, but limited by local computer resources, it can reach the maximum number of local CPU cores - 1.
  • mv & pv & notebook & Tqdm & overlap parameters have been described in detail before

For Ordinary user ↓

Quick Start

We provide executable files, which can be obtained by inputting corresponding parameters at the terminal. Download executable files in different operating environments [Windows] / [Linux]

Windows

mDELTA.exe ./ExampleFile/tree.nwk ./ExampleFile/tree.nwk -t 3

Linux

./mDELTA ../ExampleFile/tree.nwk ../ExampleFile/tree.nwk -t 3

Help

Windows: $mDELTA.exe -h
 Linux:  $./mDELTA -h
usage: MODELTA [-h] [-nt NAME2TYPEFILE] [-nt2 NAME2TYPEFILE2] [-sd SCOREDICTFILE] [-t TOP] [-m MV] [-p PV] [-T TQDM] [-n NOTEBOOK]
               [-P PVALUE] [-a ALG] [-c CPUS]
               TreeSeqFile TreeSeqFile2

Multi fork Development cell lineage tree alignment

positional arguments:
  TreeSeqFile           [path/filename] Cell lineage tree file with branch length information removed.
  TreeSeqFile2          [path/filename] Cell lineage tree file with branch length information removed.

optional arguments:
  -h, --help            show this help message and exit
  -nt NAME2TYPEFILE, --Name2TypeFile NAME2TYPEFILE
                        [path/filename] Convert tree node name to type.
  -nt2 NAME2TYPEFILE2, --Name2TypeFile2 NAME2TYPEFILE2
                        [path/filename] Convert tree node name to type.
  -sd SCOREDICTFILE, --ScoreDictFile SCOREDICTFILE
                        [path/filename] Defines the score of matches between types.
  -t TOP, --top TOP     [int > 0] Select the top few meaningful scores in the score matrix.
  -m MV, --mv MV        [float] The matching score between the same nodes.
  -p PV, --pv PV        [float] The prune score between the different nodes.
  -T TQDM, --Tqdm TQDM  [0(off) or 1(on)] Whether to display the operation progress bar.
  -n NOTEBOOK, --notebook NOTEBOOK
                        [0(off) or 1(on)] Is it written and run in the jupyter notebook environment.
  -P PVALUE, --Pvalue PVALUE
                        [int > 0] The number of times the original sequence needs to be disrupted.
  -a ALG, --Alg ALG     [KM / GA] Represent KM algorithm and GA algorithm respectively to find the maximum value of each node of
                        the score matrix
  -c CPUS, --CPUs CPUS  [int > 0] Multi process computing can greatly reduce the waiting time. The default process pool is 50, but
                        limited by local computer resources, it can reach the maximum number of local CPU cores - 1.
  -x overlap, --overlap overlap
                        [int > 0] 


Developer: Yang Lab(https://www.labxing.com/profile/10413), Details: https://github.com/Chenjy0212/modelta

Citation

If you use this project in your research, please cite this project.

@misc{modelta2022,
    author = {Jingyu Chen},
    title = {mDELTA: Multifuricating Developmental cEll Lineage Tree Alignment},
    year = {2022},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/Chenjy0212/modelta}},
}

Introduction

Student of @SYSU. :school:

Undergraduate majoring in computer science, master majoring in bioinformatics. :man_technologist:

I hope my program can be helpful to your research. :heart:

How to contact the author has been written at the top. :eyes:

sysulogo

Update

2022-05-25

Add internal node correspondence, output results: Root_ match and Root_ prune

Add a new parameter -x & --overlap. For example, if the value is x%, in the local result, the later comparison result cannot have x% or more node pairs that duplicate the previous result.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelta-1.0.3.tar.gz (39.7 kB view hashes)

Uploaded Source

Built Distribution

modelta-1.0.3-py3-none-any.whl (31.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page