Skip to main content

Phylogenetic new sample placement software.

Project description

Bases dependent Rapid Phylogenetic Clustering
(Bd-RPC)

  • Bases dependent Rapid Phylogenetic Clustering (Bd-RPC) is an efficient tool for phylogenetic cluster identification and new sample placement. The Bd-RPC contains two major modules (Make Database and Clustering New Sequences).

  • In the Make Database module, the aligned sequences will be recoded following the convert matrix, and the components of recoded aligned sequences’ matrix can be extracted for accelerating distance calculation with the principal component analysis (PCA). Then, the distance among sequences will be estimated by the recoded sequence matrix using the Minkowski Distance and the distance matrix will be chosen to match the background information (taxonomy information or phylogenetic tree) for database creation by various Hierarchical Clustering methods and the Simulated Annealing Search algorithm.

  • After the Bd-RPC database establishment, the new sequences will be added into the aligned sequences using MAFFT and the indel characters will be counted for foreign sequences’ recognition to distinguish whether the new sequences belong to the database. Then, the remainder sequences will be classified according to the Bd-RPC database through the Matching Identity cutoff and clusters’ density. Finally, for the phylogenetic database, the new sequences can be placed into the phylogenetic tree based on the clustering results.

  • The online toolkit is available on www.bd-rpc.xyz

Installation

OS Requirements

This package is supported for macOS and Linux. The package has been tested on the following systems:

  • macOS: Mojave (10.14.1)
  • Linux: Ubuntu (18.04.5)

Python dependencies

Python 3+

  • numpy
  • scipy
  • pandas
  • biopython
  • scikit-learn
  • csv

If you're having difficulties constructing the essential scientific Python packages, we recommend using the conda package/environment manager.

conda create -n bd_rpc python=3
conda activate bd_rpc
conda install numpy scipy pandas biopython scikit-learn csv

Download

git clone https://github.com/Bin-Ma/bd-rpc.git
cd bd-rpc/bin

Manual

This program can recode the aligned sequences to a list of number and match to the background information or phylogenetic tree through hierarchical clustering. For increasing the speed of this program, PCA improvement module can be selected for calcuating the distance between sequences.

Part 1 -- Make Database

BdRPC_MD.py

Usage:

BdRPC_MD.py [options] -align <location> -o <location>

Output: Bd-RPC database [cluster_location/identity/density/seq_location]

Basic options:
-align Location of aligned sequences. (required) [no punctuation mark: '/' or ',']
-o Directory to store the result. (required)
-seq_convert Location of convert matrix, the script will use (1-pi,0,0,0,1-pi,0) as default (method 1).
-PCA [on or off] Use PCA program to increase the speed or not. (default: 'on')
-PCAcomponents If "-PCA" is on, '-PCAcomponents' can be set as the PCA components. (<=number of the sequences and <= length of recoding sequences) (default: max)
-dis_exponent The exponent of minkowski distance. (default: 2)
-Cmethod The method of hierarchical clustering. (single, average, complete, ward) (default: single)
-tax_information The location of sequences taxonomy information. (csv file) [seq_id,clade,subclade,sub-subclade....] [no punctuation mark: '/' or ',']
-phy_information The location of tree with newick format. [no punctuation mark: '/' or ',']
-Cnumber If '-tax_information and -phy_information' not apply, the numebr of cluster will be calcuated without identity. (default: 5)
-bootstrap_cutoff The cutoff value to stop the tree traversal. (default: 90)

### Part 2 -- Clustering new sequences BdRPC_CNS.py

Usage:
BdRPC_CNS.py [options] -align <location> -new <location> -o <location> -db <location>

Output: gap-t-test result [seq_id in/out] / clustering result [seq_id cluster_name/tree_location] / combined tree

Basic options:
-align Location of aligned sequences. (required) [no punctuation mark: '/' or ',']
-new Location of new sequences. (required)
-o Directory to store the result. (required)
-db Location of Bd-RPC database. (required)
-IDfold The fold of median value in Indel Test. (default: 1.1)
-phy_information Location of phylogentics tree. (if the tree is available, the new sequences will be inserted into the phylogenetic tree)
-identity_cutoff The cutoff value of clusters' identity. (0~1, default: 0.8)
-density_fold The fold of clusters' density for new samples clustering. (default: 1.5)
-threads Threads of mafft align and iqtree. (int, default: 1)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

BdRPCpackage-1.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

BdRPCpackage-1.3-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file BdRPCpackage-1.3.tar.gz.

File metadata

  • Download URL: BdRPCpackage-1.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for BdRPCpackage-1.3.tar.gz
Algorithm Hash digest
SHA256 b2823b5645934701c13b1569ca20dd864baf4cedcd1c04154e64824f264da101
MD5 e841b86370c9aa79b51e392e8f26a925
BLAKE2b-256 0d2d3c34c00a0de8044e5f259dfcbe3ea6e92f36c815e6f3af870230aef4399d

See more details on using hashes here.

File details

Details for the file BdRPCpackage-1.3-py3-none-any.whl.

File metadata

  • Download URL: BdRPCpackage-1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for BdRPCpackage-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8df67b9c85f9ed1af1804ca0fccc2b6fcd5d963bfd7cd21e96b7d0dea5e4d36e
MD5 1fc63cefc6948f7df5c08f190c888e86
BLAKE2b-256 1e7bfa9a1f042ce1067a8b24b1b447626ff0adb94e3486bfa17efb7cadab091b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page