Chen and Yang Lab Multi fork Development cell lineage tree alignment
Project description
mDELTA
-
Yang Lab Multifuricating Developmental cEll Lineage Tree Alignment (mDELTA) algorithm package and executable program.
-
You can get the score matrix through them to analyze the node relationship of the pedigree tree, or test the correlation.
-
You can star this repository to keep track of the project if it's helpful for you, thank you for your support.
Install
Required package
- pandas: Score matrix architecture based on dataframe.
- numpy: Many computing essential packages.
- munkres: An algorithm for finding the maximum value of score matrix dynamic programming
Optional package
- tqdm: Displays the progress during the calculation phase.
- multiprocess: When calculating the p value, because it needs to disrupt the original sequence many times and perform multiple calculations, using multiple processes can effectively reduce the waiting time.
Pip install
$"pip install modelta"
Source code install
(1) Offline
Step1: $git clone https://github.com/Chenjy0212/modelta.git
Step2: $cd modelta -> run "python setup.py install"
(2) Online
$pip install git+https://github.com/Chenjy0212/modelta.git@main
For python coder user ↓
Quick Start
You can use this package in your Python code. For example, run under Jupiter notebook:
import modelta
from pprint import pprint
example = modelta.scoremat(TreeSeqFile = 'ExampleFile/tree.nwk',
TreeSeqFile2 = 'ExampleFile/tree.nwk',
Name2TypeFile = 'ExampleFile/Name2Type.csv',
Name2TypeFile2 ='ExampleFile/Name2Type.csv',
top = 3,
notebook = 1,
overlap = 5,)
pprint(example)
Result
Matrix Node: |██████████| 121/121 100%
121/121 [00:00<00:00, 2573.11it/s]
{'TopScoreList': [{'Root1_label': 'root',
'Root1_match': ['0',
'1',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root1_node': '(((a,b,c),d,(e,f)),a)',
'Root1_prune': [],
'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
'Root2_label': 'root',
'Root2_match': ['0',
'1',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root2_node': '(((a,b,c),d,(e,f)),a)',
'Root2_prune': [],
'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
'Score': 14.0,
'col': 10,
'row': 10},
{'Root1_label': '0',
'Root1_match': ['0',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root1_node': '((a,b,c),d,(e,f))',
'Root1_prune': ['1'],
'Root1_seq': '((a1,a2,a3),a4,(a5,a6))',
'Root2_label': 'root',
'Root2_match': ['0',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root2_node': '(((a,b,c),d,(e,f)),a)',
'Root2_prune': ['1'],
'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
'Score': 11.0,
'col': 10,
'row': 9},
{'Root1_label': '0,0',
'Root1_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
'Root1_node': '(a,b,c)',
'Root1_prune': ['0,1', '0,2,0', '0,2,1'],
'Root1_seq': '(a1,a2,a3)',
'Root2_label': '0',
'Root2_match': ['0,0', '0,0,0', '0,0,1', '0,0,2'],
'Root2_node': '((a,b,c),d,(e,f))',
'Root2_prune': ['0,1', '0,2,0', '0,2,1', '1'],
'Root2_seq': '((a1,a2,a3),a4,(a5,a6))',
'Score': 3.0,
'col': 9,
'row': 7}],
'matrix': Root2 0,0,0 0,0,1 0,0,2 0,1 0,2,0 0,2,1 1 0,0 0,2 0 root
Root1
0,0,0 2.0 -1.0 -1.0 -1.0 -1.0 -1.0 2.0 0.0 -1.0 -1.0 -1.0
0,0,1 -1.0 2.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 -1.0 -1.0 -1.0
0,0,2 -1.0 -1.0 2.0 -1.0 -1.0 -1.0 -1.0 0.0 -1.0 -1.0 -1.0
0,1 -1.0 -1.0 -1.0 2.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
0,2,0 -1.0 -1.0 -1.0 -1.0 2.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0
0,2,1 -1.0 -1.0 -1.0 -1.0 -1.0 2.0 -1.0 -1.0 1.0 -1.0 -1.0
1 2.0 -1.0 -1.0 -1.0 -1.0 -1.0 2.0 0.0 -1.0 -1.0 -1.0
0,0 0.0 0.0 0.0 -1.0 -1.0 -1.0 0.0 6.0 -2.0 3.0 2.0
0,2 -1.0 -1.0 -1.0 -1.0 1.0 1.0 -1.0 -2.0 4.0 0.0 -1.0
0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 3.0 0.0 12.0 11.0
root -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 2.0 -1.0 11.0 14.0}
Parameter analysis
If the parameter has an *
, it is required; otherwise, it is optional
TreeSeqFile
&TreeSeqFile2
: [path/filename*
] Cell lineage tree file with branch length information removed. The format of reference documents is as follows: ExampleFile/tree.nwkmv
: [float anddefault
= 2.] The matching score between the same nodes, which is often used when the parameterScoreDictFile
is the default.pv
: [float anddefault
= -1.] The prune score between the different nodes.top
: [int > 0 anddefault
= 0] Select the top few meaningful scores in the score matrix. if it is default:
{'T1root_T2root': [{'Root1_label': 'root',
'Root1_match': ['0',
'1',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root1_node': '(((a,b,c),d,(e,f)),a)',
'Root1_prune': [],
'Root1_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
'Root2_label': 'root',
'Root2_match': ['0',
'1',
'0,0',
'0,1',
'0,2',
'0,0,0',
'0,0,1',
'0,0,2',
'0,2,0',
'0,2,1'],
'Root2_node': '(((a,b,c),d,(e,f)),a)',
'Root2_prune': [],
'Root2_seq': '(((a1,a2,a3),a4,(a5,a6)),a1)',
'Score': 14.0,
'col': 10,
'row': 10}],
......
}
notebook
: [bool anddefault
=False] Is it written and run in the jupyter notebook environment.Tqdm
: [bool anddefault
=True] Whether to display the operation progress bar.overlap
: [int > 0 anddefault
= 0] In the local results, the later comparison results cannot have X% or more node pairs that duplicate the previous results.merge
: [int > 0 anddefault
= 0] Merge internal node to prune.
if Qualitative calculation:
Name2TypeFile
&Name2TypeFile2
: [path/filename*
] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csvScoreDictFile
: [path/filename anddefault
=''] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/socrefile.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
If the file is empty, only the same nodes are taken for pairing, and the default matching score is 2 (float)
node: a <-> a = 2.(custom)
b <-> b = 3.(custom)
a <-> b = ?(custom)
The higher the score, the stronger the similarity
If Quantitative calculation
ScoreDictFile
: [path/filename*
] Defines the score of matches between nodes. The format of reference documents is as follows: ExampleFile/Qscorefile.csvName2TypeFile
&Name2TypeFile2
: [path/filename or No input] Convert tree node name to type. The format of reference documents is as follows: ExampleFile/Name2Type.csv
The matching score between nodes is determined according to the "ScoreDictFile" file.
The file is required. You can modify the score of the same node by modifying parameter "mv"
Gene0 Gene1 Gene2
a 1 2 3
b 2 3 4
node: (1-2)**2 + (2-3)**2 + (3-4)**2 #Euclidean distance
Then get the final score according to the smoothing function.
The lower the score, the stronger the similarity
P-value calculation
modelta.pvalue(times = 3,
topscorelist = example['TopScoreList'],
ScoreDictFile='',
CPUs = 50,
mv = 2,
pv = -1)
Result
Pvalue : 100%|██████████| 3/3 [00:00<00:00, 4.05it/s]
Pvalue : 100%|██████████| 3/3 [00:00<00:00, 4.38it/s]
Pvalue : 100%|██████████| 3/3 [00:00<00:00, 4.45it/s]
[[3.0, 4.0, 0.0, 14.0], [4.0, 5.0, 3.0, 11.0], [5.0, 0.0, 1.0, 11.0]]
The returned results represent times
matching scores corresponding to the top
maximum values
Parameter analysis
If the parameter has an *
, it is required; otherwise, it is optional
times
: [int > 0*
] The number of times the original sequence needs to be disrupted, such as:
times = 3 #Randomly disrupt the nodes, but the structure remains unchanged
(((a,b,c),d,(e,f)),a) -> (((a,b,c),d,(e,f)),a)
-> (((a,c,d),b,(a,f)),e)
-> (((e,f,a),d,(b,c)),a)
topscorelist
: [example['TopScoreList']*
] The input parameter is the maximum value sequence obtained earlier.CPUs
: [int > 0 anddefault
= 50] Multi process computing can greatly reduce the waiting time. The default process pool is 50, but limited by local computer resources, it can reach the maximum number of local CPU cores - 1.mv
&pv
¬ebook
&Tqdm
&overlap
parameters have been described in detail before
For Ordinary user ↓
Quick Start
We provide executable files, which can be obtained by inputting corresponding parameters at the terminal. Download executable files in different operating environments [Windows] / [Linux]
Windows
mDELTA.exe ./ExampleFile/tree.nwk ./ExampleFile/tree.nwk -t 3
Linux
./mDELTA ../ExampleFile/tree.nwk ../ExampleFile/tree.nwk -t 3
Help
Windows: $mDELTA.exe -h
Linux: $./mDELTA -h
usage: MODELTA [-h] [-nt NAME2TYPEFILE] [-nt2 NAME2TYPEFILE2] [-sd SCOREDICTFILE] [-t TOP] [-m MV] [-p PV] [-T TQDM] [-n NOTEBOOK]
[-P PVALUE] [-a ALG] [-c CPUS]
TreeSeqFile TreeSeqFile2
Yang Lab Multifuricating Developmental cEll Lineage Tree Alignment (mDELTA) algorithm
positional arguments:
TreeSeqFile [path/filename] Cell lineage tree file with branch length information removed.
TreeSeqFile2 [path/filename] Cell lineage tree file with branch length information removed.
optional arguments:
-h, --help show this help message and exit
-nt NAME2TYPEFILE, --Name2TypeFile NAME2TYPEFILE
[path/filename] Convert tree node name to type.
-nt2 NAME2TYPEFILE2, --Name2TypeFile2 NAME2TYPEFILE2
[path/filename] Convert tree node name to type.
-sd SCOREDICTFILE, --ScoreDictFile SCOREDICTFILE
[path/filename] Defines the score of matches between types.
-t TOP, --top TOP [int > 0] Select the top few meaningful scores in the score matrix.
-m MV, --mv MV [float] The matching score between the same nodes.
-p PV, --pv PV [float] The prune score between the different nodes.
-T TQDM, --Tqdm TQDM [0(off) or 1(on)] Whether to display the operation progress bar.
-n NOTEBOOK, --notebook NOTEBOOK
[0(off) or 1(on)] Is it written and run in the jupyter notebook environment.
-P PVALUE, --Pvalue PVALUE
[int > 0] The number of times the original sequence needs to be disrupted.
-a ALG, --Alg ALG [KM / GA] Represent KM algorithm and GA algorithm respectively to find the maximum value of each node of
the score matrix
-c CPUS, --CPUs CPUS [int > 0] Multi process computing can greatly reduce the waiting time. The default process pool is 50, but
limited by local computer resources, it can reach the maximum number of local CPU cores - 1.
-x overlap, --overlap overlap
[int > 0]
-mg merge, --merge merge internal node to prune.
[0(off) or 1(on)]
Developer: Yang Lab(https://www.labxing.com/profile/10413), Details: https://github.com/Chenjy0212/modelta
Citation
If you use this project in your research, please cite this project.
@misc{modelta2022,
author = {Jingyu Chen},
title = {mDELTA: Multifuricating Developmental cEll Lineage Tree Alignment},
year = {2022},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Chenjy0212/modelta}},
}
Introduction
Student of @SYSU. :school:
Undergraduate majoring in computer science, master majoring in bioinformatics. :man_technologist:
I hope my program can be helpful to your research. :heart:
How to contact the author has been written at the top. :eyes:
Update
2022-05-25
Add internal node correspondence, output results: Root_ match
and Root_ prune
Add a new parameter -x
& --overlap
. For example, if the value is x%, in the local result, the later comparison result cannot have x% or more node pairs that duplicate the previous result.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.