Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single-Cell Data Using Efficient Local Search
Project description
Fast cell lineage tree reconstruction and genotype calling for large single cell DNA sequencing data
Software accompanyment for "ScisTree2 enables large-scale inference of cell lineage trees and genotype calling using efficient local search", Haotian Zhang, Yiming Zhang, Teng Gao and Yufeng Wu, Genome Research, in press, 2025 (https://genome.cshlp.org/content/early/2025/09/03/gr.280542.125.abstract). Here is the preprint in a slightly different title Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single-Cell Data Using Efficient Local Search, Haotian Zhang, Yiming Zhang, Teng Gao and Yufeng Wu, manuscript, 2025 (under the title "ScisTree2: An Improved Method for Large-scale Inference of Cell Lineage Trees and Genotype Calling from Noisy Single Cell Data", presented in the RECOMB 2025 conference).
This is an enhanced version of ScisTree (Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach, Bioinformatics, Wu, Volume 36, Issue 3, Pages 742–750, 2020).
Documentation
You can refer to our documentation for more details, or simply follow the instructions below.
Required Tools
To use ScisTree2, you will need the following tools and libraries installed:
python&pip: Version 3.6 or higher.g++: A C++ compiler. (from source only)make: For building the backend. (from source only)
*We have successfully tested it on Linux, macOS, and Windows (via WSL).
Installation
1. Install from PyPI:
- Upgrade pip:
python -m pip install --upgrade pip
- Install
scistree2:pip install scistree2
2. Install from source:
-
Clone the repository:
git clone https://github.com/yufengwudcs/ScisTree2.git cd scistree2
-
Install the Python package (includes C++ backend compilation): You can install the
scistree2package usingpip:pip install .
This command will also automatically compile the C++ backend. Once built, the executable binary file can be found in
scistree2/bin.*We recommend that users create a virtual environment using either
condaorvenvto comply with PEP 668. -
(Optional) Manual C++ backend build (for testing/development): If you want to build or test the C++ backend (
scistree) independently, you can navigate to thesrcdirectory and compile it usingmake:cd src make # You can then test it directly, e.g., ./scistree example_input.txt
This step is not required for the Python package installation if using
pip install .as described above. See more details below.
Tutorial
ScisTree2 offers both Python and C++ interfaces. We recommend using the Python version because it provides a wider variety of supported input formats and evaluation tools, and it is more easily integrated into the broader Python ecosystem.
Using ScisTree2 in Python
A detailed tutorial on how to use ScisTree2 in Python is available as a Jupyter Notebook in the tutorials/ directory:
The tutorial covers:
- Getting started with ScisTree2.
- Running inference with probabilistic genotype matrices (CSV supported).
- Running inference with raw read data (CSV supported).
- Running inference with VCF file.
- Visualizing trees.
- Evaluating results using various metrics.
- Bootstrapping for branch (clade) confidence estimates (added September 27, 2025).
The example data used in the tutorial can be found in the tutorials/data/ directory.
Using ScisTree2 in C++
To run ScisTree2 directly from the console, please refer to step 3 in the installation guide above.
The executable is called scistree.
Check if ScisTree2 is ready to run by typing: ./scistree, you should see some output about the basic usage of ScisTree2.
Now type:
./scistree example_input.txt, you should see the following output:
*** SCISTREE ver. 2.2.3.0, August 14, 2025 ***
#cells: 5, #sites: 6
List of cell names: c1 c2 c3 c4 c5
Called genotypes output to file: example_input.txt.genos.imp
**** Maximum log-likelihood: -6.27126, number of changed genotypes: 2
Computed log-lielihood from changed genotypes: -6.27126
Constructed single cell phylogeny: (((c1,c3),(c2,c4)),c5)
Elapsed time = 0 seconds.
Options:
-e: Output a mutation tree (which may not be a binary tree) with branch labels from the called genotypes.-e0: Output a mutation tree without branch labels, which is useful for visualizing large trees.-q: Use NNI (Nearest Neighbor Interchange) for local tree search. NNI is faster but less accurate. By default, ScisTree2 uses SPR (Subtree Pruning and Regrafting) local search, which we have found to be very fast.-T <num-of-threads>: Specify the number of threads for multi-threading support.-s <num-of-iterations>: Set the maximum number of iterations to control the running time. A smaller number (e.g., 5) will reduce the running time but may also reduce accuracy. Default: 1,000 iterations.
You may also read the ScisTree2's User Manual, which is in PDF format and is distributed as part of ScisTree2.
Data format of ScisTree2 in C++?
First, you should understand some basics about ScisTree2. I would recommend to read the user mannual of the orgianl ScisTree.
The first thing to use ScisTree2 is to prepare the input. Here is the content of an example(example_input.txt):
c1 c2 c3 c4 c5
s1 0.01 0.6 0.08 0.8 0.7
s2 0.8 0.02 0.7 0.01 0.3
s3 0.02 0.8 0.02 0.8 0.9
s4 0.9 0.9 0.8 0.8 0.02
s5 0.01 0.8 0.01 0.8 0.9
s6 0.05 0.02 0.7 0.05 0.9
Explanations:
-
You should specifiy the cell names in the first row. For example, "c1 c2 c3 c4 c5". Please note that don't use HAPLOID or HAPLOTYPES as cell names. These two words are reserved keywords in ScisTree2.
-
The following row starts with the row identifier, then the probability of the five cells being zero (wild-type). For example, the second row says for the first site, the probability of the first cell (cell 1) has probability 0.01 being the wild type, the second cell has probability 0.6 being the wild type, and so on.
Be careful: the rows are for the SNV sites and the columns are for the cells. Don't get this wrong.
ScisTree2 is essentially a faster and also somewhat more accurate ScisTree. Some features from the original ScisTree (version 1) are not supported in the current implementaiton of ScisTree2. These include: (i) ternary data input: ScisTree2 only supports binary data as of now; (ii) parameter imputation and doublet imputation. I haven't got chance to upgrade these features. For the moment, ScisTree2 is dedicated for cell lineage tree inference.
What is new about ScisTree2 over ScisTree?
The main change is about speed and accuracy. ScisTree2 is order of mangnitude faster than ScisTree. ScisTree2 supports multi-threading while ScisTree doesn't. More importantly, ScisTree2 implements faster and also possibly more accurate tree search algorithms. By default, ScisTree2 performs the subtree prune and regraft (SPR) local search, while ScisTree performs neareast neighbor interchange (NNI) search. The SPR local search is usually more accurate than the NNI search. Our tests show that ScisTree2 can infer cell lineage tree from data with 10,000 cells (and say 10,000 single nucleiotide variant or SNV sites) while being more accurate in both cell lineage tree and genotype calling.
Data Availability
All simulated data, experimental data(HGSOC), and scripts used to reproduce the results in the SicsTree2 paper are released at Zenodo.
Contact
Post your issues here inside GitHub repositary if you have questions/issues.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scistree2-0.4.0.tar.gz.
File metadata
- Download URL: scistree2-0.4.0.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ee97ad58e3dfaf41a7f25f4e38e064e350bacbb149313b7530daa017f15fb24
|
|
| MD5 |
918b84d6eaf1c593d75a5834e1ce8e64
|
|
| BLAKE2b-256 |
b50a81f45f3afe11b92cd46485867f9daa3e7d19e4e209cc1fb1ba8442799c8f
|
File details
Details for the file scistree2-0.4.0-py3-none-any.whl.
File metadata
- Download URL: scistree2-0.4.0-py3-none-any.whl
- Upload date:
- Size: 732.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2f7a36ff86173d20200c2d5fe26f65b04dd413ea643642dd2bd11606290c4a8
|
|
| MD5 |
b7390e2586726601dbf42faa5109b96d
|
|
| BLAKE2b-256 |
31b92fb933ba7e8229572e809d3bafa1779519a5c0788935cd53354dd22d4841
|