Phylogenetic new sample placement software.
Project description
Bases dependent Rapid Phylogenetic Clustering
(Bd-RPC)
- Bases dependent Rapid Phylogenetic Clustering (Bd-RPC) is an efficient tool for phylogenetic cluster identification and new sample placement. The Bd-RPC contains two major modules (Make Database and Clustering New Sequences).
- In the Make Database module, the aligned sequences will be recoded following the convert matrix, and the components of recoded aligned sequences’ matrix can be extracted for accelerating distance calculation with the principal component analysis (PCA). Then, the distance among sequences will be estimated by the recoded sequence matrix using the Minkowski Distance and the distance matrix will be chosen to match the background information (taxonomy information or phylogenetic tree) for database creation by various Hierarchical Clustering methods and the Simulated Annealing Search algorithm.
- After the Bd-RPC database establishment, the new sequences will be added into the aligned sequences using MAFFT and the indel characters will be counted for foreign sequences’ recognition to distinguish whether the new sequences belong to the database. Then, the remainder sequences will be classified according to the Bd-RPC database through the Matching Identity cutoff and clusters’ density. Finally, for the phylogenetic database, the new sequences can be placed into the phylogenetic tree based on the clustering results.
- The online toolkit is available on www.bd-rpc.xyz
Installation
OS Requirements
This package is supported for macOS and Linux. The package has been tested on the following systems:
- macOS: Mojave (10.14.1)
- Linux: Ubuntu (18.04.5)
Python dependencies
Python 3+
- numpy
- scipy
- pandas
- biopython
- scikit-learn
- csv
If you're having difficulties constructing the essential scientific Python packages, we recommend using the conda package/environment manager.
conda create -n bd_rpc python=3
conda activate bd_rpc
conda install numpy scipy pandas biopython scikit-learn csv
Download
git clone https://github.com/Bin-Ma/bd-rpc.git
cd bd-rpc/bin
Manual
This program can recode the aligned sequences to a list of number and match to the background information or phylogenetic tree through hierarchical clustering. For increasing the speed of this program, PCA improvement module can be selected for calcuating the distance between sequences.
Part 1 -- Make Database
BdRPC_MD.py
Usage:
BdRPC_MD.py [options] -align <location> -o <location>
Output: Bd-RPC database [cluster_location/identity/density/seq_location]
Basic options: | |
---|---|
-align | Location of aligned sequences. (required) [no punctuation mark: '/' or ','] |
-o | Directory to store the result. (required) |
-seq_convert | Location of convert matrix, the script will use (1-pi,0,0,0,1-pi,0) as default (method 1). |
-PCA [on or off] | Use PCA program to increase the speed or not. (default: 'on') |
-PCAcomponents | If "-PCA" is on, '-PCAcomponents' can be set as the PCA components. (<=number of the sequences and <= length of recoding sequences) (default: max) |
-dis_exponent | The exponent of minkowski distance. (default: 2) |
-Cmethod | The method of hierarchical clustering. (single, average, complete, ward) (default: single) |
-tax_information | The location of sequences taxonomy information. (csv file) [seq_id,clade,subclade,sub-subclade....] [no punctuation mark: '/' or ','] |
-phy_information | The location of tree with newick format. [no punctuation mark: '/' or ','] |
-Cnumber | If '-tax_information and -phy_information' not apply, the numebr of cluster will be calcuated without identity. (default: 5) |
-bootstrap_cutoff | The cutoff value to stop the tree traversal. (default: 90) |
### Part 2 -- Clustering new sequences BdRPC_CNS.py
Usage:
BdRPC_CNS.py [options] -align <location> -new <location> -o <location> -db <location>
Output: gap-t-test result [seq_id in/out] / clustering result [seq_id cluster_name/tree_location] / combined tree
Basic options: | |
---|---|
-align | Location of aligned sequences. (required) [no punctuation mark: '/' or ','] |
-new | Location of new sequences. (required) |
-o | Directory to store the result. (required) |
-db | Location of Bd-RPC database. (required) |
-IDfold | The fold of median value in Indel Test. (default: 1.1) |
-phy_information | Location of phylogentics tree. (if the tree is available, the new sequences will be inserted into the phylogenetic tree) |
-identity_cutoff | The cutoff value of clusters' identity. (0~1, default: 0.8) |
-density_fold | The fold of clusters' density for new samples clustering. (default: 1.5) |
-threads | Threads of mafft align and iqtree. (int, default: 1) |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file BdRPCpackage-1.3.tar.gz
.
File metadata
- Download URL: BdRPCpackage-1.3.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2823b5645934701c13b1569ca20dd864baf4cedcd1c04154e64824f264da101 |
|
MD5 | e841b86370c9aa79b51e392e8f26a925 |
|
BLAKE2b-256 | 0d2d3c34c00a0de8044e5f259dfcbe3ea6e92f36c815e6f3af870230aef4399d |
File details
Details for the file BdRPCpackage-1.3-py3-none-any.whl
.
File metadata
- Download URL: BdRPCpackage-1.3-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8df67b9c85f9ed1af1804ca0fccc2b6fcd5d963bfd7cd21e96b7d0dea5e4d36e |
|
MD5 | 1fc63cefc6948f7df5c08f190c888e86 |
|
BLAKE2b-256 | 1e7bfa9a1f042ce1067a8b24b1b447626ff0adb94e3486bfa17efb7cadab091b |