Global-Global genetic database search.
Project description
vpsearch - Fast Vantage-Point Tree Search for Sequence Databases
This is a package for indexing and querying a sequence database for fast nearest-neighbor search by means of vantage point trees. For reasonably large databases, such as RDP, this results in sequence lookups that are typically 5-10 times faster than other alignment-based lookup methods.
Vantage-point tree search uses global-to-global alignment to compare sequences, rather than seed-and-extend approximative methods as used for example by BLAST.
Usage
Given a sequence database (in FASTA format), vpsearch build
constructs an
optimized vantage point search tree. Building the tree is a one-time operation
and doesn't have to be done again unless the database changes. As an
illustration, we build a vantage point tree for the RDP database of bacterial
16S sequences. This database contains 281261 sequences of which 39237 are
duplicates. After removing these duplicates, we are left with 242024 unique
sequences. Building a tree for these sequences is done with:
$ vpsearch build rdp_download_281261seqs_dedup.fa
Building for 242024 sequences...
done.
Linearizing...done.
Database created in rdp_download_281261seqs_dedup.db
For the RDP database of full length sequences, this takes about 20 minutes on a standard machine. When only selected regions of the sequences are considered, the time needed to build a tree can be much reduced. For example, vantage point trees for the v1-v2 hypervariable region (350 base pairs) or the v3-v4 region (250 base pairs) of the RDP 16S sequencese can be built in 30 seconds to 1 minute.
Once a tree has been built, unknown sequences can be looked up using the
vpsearch query
command. Here we supply a query file with a single sequence
vpsearch query rdp_download_281261seqs_dedup.fa query.fa
query S000143715 99.54 1529 0 0 1 1524 1 1529 0 7546
query S004085923 99.08 1529 0 0 1 1524 1 1526 0 7481
query S004085922 99.08 1529 0 0 1 1524 1 1526 0 7481
query S004085925 98.50 1531 0 0 1 1524 1 1527 0 7386
By default, the vpsearch query
command outputs the best four matches in the
database per query sequence (the number of matches can be changed with the -k
parameter). Lookup is done one query sequence at a time, but multiple queries
can be considered in parallel by enabling multiple threads; use the -j
option
to specify the number of threads.
The vpsearch query
command attempts to output its results in the standard
BLAST tabular format. The interpretation of the columns is as follows:
Column name | Example | Notes |
---|---|---|
query ID | query | |
subject ID | S000143715 | |
% identity | 99.54 | |
alignment length | 1529 | |
mismatches | 0 | currently not implemented |
gap openings | 0 | currently not implemented |
query start | 1 | |
query end | 1524 | |
subject start | 1 | |
subject end | 1529 | |
E-value | 0 | N/A (always 0) |
bit score | 7546 | interpreted as the alignment score |
Note that the number of mismatches and gap openings are currently not displayed in the result output. This will be addressed in a future version of the package.
Installation
Using EDM
Users of the Enthought Deployment Manager(EDM) can install the necessary prerequisites (Click, Cython, Numpy, and Parasail) by importing an EDM environment from the bundle file shipped with this repository
edm env import -f <bundle.json> vpsearch
where <bundle.json>
is one of vpsearch_py3.6_osx-x86_64.json
or
vpsearch_py3.6_rh6-x86_64.json
, depending on your platform.
When this is done, activate the environment, and install this package. From the root of this repository, run
edm shell -e vpsearch
pip install -e .
Using Pip, Conda, etc.
Users of other package installation tools, such as Pip or Conda, need to install the Parasail library following the instructions on the Parasail web page. Once that is done, the Python dependencies can be installed using the appropriate command for your package manager. For pip, for example, this can be done with
pip install -r requirements.txt
Once that is done, activate your virtual environment, and install this package via
pip install -e .
Using Docker
It is possible to build a Docker image that contains vpsearch as well as all of its dependencies. This is useful, for example, when integrating vpsearch into a workflow manager, like Snakemake, CWL, or WDL.
To build the image, run the following command from the root of this repository:
docker build . -t vpsearch-image
Once the image has been built, vpsearch can then be run from within a
container. Assuming you have a FASTA file of target sequences in the file
database.fasta
in the current directory, run the following to build a
vpsearch index:
docker run -it -v $PWD:/data -t vpsearch-image vpsearch build /data/database.fasta
To query the index for a given FASTA file query.fasta
of query sequences,
run:
docker run -it -v $PWD:/data -t vpsearch-image vpsearch query /data/database.db /data/query.fasta
Troubleshooting
The vpsearch package relies on the Parasail C library for alignment. If
building the package fails because the Parasail library cannot be found, you
can manually specify the location of the Parasail include files and shared
object libraries by setting the PARASAIL_INCLUDE_DIR
and PARASAIL_LIB_DIR
environment variables before building the package:
export PARASAIL_INCLUDE_DIR=/location/of/parasail/include/files
export PARASAIL_LIB_DIR=/location/of/parasail/lib/files
pip install -e .
Note that if Parasail is installed in a non-standard location, you may have to
set the LD_LIBRARY_PATH
variable at runtime.
Implementation notes
The tree construction operates in two phases. We first build the tree as a tree of Python object nodes because it's easier to build with a dynamic data structure. Then it linearizes the topology of the nodes into a few integer arrays that are easy to serialize and fast to look up. The object that represents the linearized tree can only query the database, not build the tree. The slower tree-of-nodes implementation can build and query (albeit with more overhead).
Building wheels
Wheels for this package can be built in a platform-independent way using cibuildwheel, running under GitHub actions. As an administrator, you can start a workflow to build wheels by selecting the "Build wheels" action from the GitHub actions menu, and clicking the "Run workflow" button. When the workflow completes, wheels for Linux and macOS will be available as a zipped artifact.
It is possible to run cibuildwheels locally, but only to build wheels for
Linux. In a clean Python environment, run pip install cibuildwheel
to install
the tool, followed by e.g.
CIBW_BUILD_VERBOSITY=1 \
CIBW_BUILD=cp38-manylinux_x86_64 \
CIBW_BEFORE_BUILD="./ci/build-parasail.sh" \
python -m cibuildwheel --output-dir wheelhouse --platform linux
to build Python 3.8 wheels for Linux. By varying the build tag, wheels for other Python versions can be built.
License
This package is licensed under the BSD license.
References
Vantage point trees were introduced in
Uhlmann, Jeffrey (1991). "Satisfying General Proximity/Similarity Queries with Metric Trees". Information Processing Letters. 40 (4): 175–179. doi:10.1016/0020-0190(91)90074-r.
Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces (PDF). Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321. pny93.
The Parasail library is described in
Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for vpsearch-0.1.2-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46d7debb19b68c7e70708b1dc2e95fdde29d2d8bdbdad3bf1e142518c2c4ac15 |
|
MD5 | 84567c83a74b70c98b143b2e24a5efa3 |
|
BLAKE2b-256 | 0b954edc8a35037e96760b096178ea90a01d3e94c62e74a7b302692480586946 |
Hashes for vpsearch-0.1.2-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddbb3b1b376fd65a6bba48d1ddcdf80dc374c2d7785aa2211c9db9018d8bf547 |
|
MD5 | 76da5c2272f7674e9b73d4f3695b1448 |
|
BLAKE2b-256 | eb3688f058ed4c513d24dcc6fe3e18c63596aa4699de110f34b1bc625f131188 |
Hashes for vpsearch-0.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1643d9da46e412aa1ef50a329dc8222dc5aa8a8053a6237fe4aa595ba8daa713 |
|
MD5 | cec04f0647c754e130cfd670dfc018ba |
|
BLAKE2b-256 | 943ab049d909b725a28727f5d5e5ba3c37a895faacd53aa70099779716da3cf7 |
Hashes for vpsearch-0.1.2-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2b007286b360fcd61820bb47856b1e907709a46e5e5d7ae49e71292f9d5feb1 |
|
MD5 | 18819e810355f128fe3594ace8656740 |
|
BLAKE2b-256 | a4dae29d87956453ad4619fca65f702a74785be981f94d499ac86fd8240705fb |
Hashes for vpsearch-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8780b0f5ccde28fa11123c8307858c4deedb4e676d981249011c7a562b31461 |
|
MD5 | a50f7e3162181902f09a6c4e9b098c5f |
|
BLAKE2b-256 | 9aaa21b6719581b3cc3ab25b25b001c5a1f41f8a24d9897a8911f485540e505f |
Hashes for vpsearch-0.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0610d7de925954f1f4636ea1f0c513f82a0d78997ccfe832800009d9318b684 |
|
MD5 | 9fe97d51b7b135c05a42cda7b51dab21 |
|
BLAKE2b-256 | 8ebb76469bb83d8ff06022ee77bcddf0c75ac703771c553aa22a7edb11afd2a2 |
Hashes for vpsearch-0.1.2-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7124427dfa57b3da70a1be88b079a8cfa360ca83ec0caac825b93ac394fb69ea |
|
MD5 | 7be1f299cd546af73a14c7e08fd3b953 |
|
BLAKE2b-256 | 5801fc5ef9ab589a1636e122ffa5007b527f1e3c9370befb34042c1f786bb5e2 |
Hashes for vpsearch-0.1.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75904c78bc0523b7572085a8c510e2666a81421b1804e447d41fc770d0d81382 |
|
MD5 | 207eaa4986e2fae9d3b89f9d8f7347b7 |
|
BLAKE2b-256 | 6f9075cf49190f2ea5217d0fc92598d30677d96551a122b6bad8ebbf58145f48 |
Hashes for vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0468baa453f0df85264487a18ab94a6281bb0f864854f9c6f0bc1d8fa997cff2 |
|
MD5 | 4b7e7d76c23bfa38ab83aad42c01844a |
|
BLAKE2b-256 | ac98f77358b8cf2711a31ce89042b25d82c1a5cdcf8f6b46813c5ec4cd150820 |
Hashes for vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3968860fa3838a9588c9de977bad0a9a2ca73fdf8a47534fffc5ec8a6d30225d |
|
MD5 | 6b5955ce55620659cc441c52b106e44b |
|
BLAKE2b-256 | 37b870fc9e835d3ee411a049f5025417c4831cb47ae3ed71743edd26dc477386 |
Hashes for vpsearch-0.1.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a68ee6957011f2d8853381134e2e4bcdf76b91e6cd01a01bd0e068ff82aa1f8 |
|
MD5 | fe6febb74b9b5aa2e19459a1b595031d |
|
BLAKE2b-256 | aecb8ec68589b09736bd662a786d14280e0267acd263fa92ed6dbe6ff7e6705a |
Hashes for vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 064994ea5e8d3eedcc9d45d8980fe8e943b71eb56427a7e2b86cd784a35b0f76 |
|
MD5 | 031fdc2cb65b4d2b3c3cb583694e2f83 |
|
BLAKE2b-256 | 66096c46a7dbf272d640f46a67ed7ce924b5fb355b9bad7211abfca7d25db630 |
Hashes for vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ba04508986dd7ead9078a89f89a8604305d683b8143f80cb2769f1e58ad586b |
|
MD5 | 6f4bd44c530d97c1ee8426303dc53edd |
|
BLAKE2b-256 | fc8fed08a0b6a86da17e11663028127976568cb14ba0d51728658fd030ff05c6 |
Hashes for vpsearch-0.1.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3b4fb5fc2c2773e99f5566e452a17d0cfa8df1ea245e12ade125f31fed22c3b |
|
MD5 | b7c9278f23e5ed6ac909ffda17f066cf |
|
BLAKE2b-256 | fb1e39e4eb5332c9a046ae15da1fa9ee187632623a7eae7eb38761f5c157b69b |