Skip to main content

Global-Global genetic database search.

Project description


vpsearch - Fast Vantage-Point Tree Search for Sequence Databases

This is a package for indexing and querying a sequence database for fast nearest-neighbor search by means of vantage point trees. For reasonably large databases, such as RDP, this results in sequence lookups that are typically 5-10 times faster than other alignment-based lookup methods.

Vantage-point tree search uses global-to-global alignment to compare sequences, rather than seed-and-extend approximative methods as used for example by BLAST.

Usage

Given a sequence database (in FASTA format), vpsearch build constructs an optimized vantage point search tree. Building the tree is a one-time operation and doesn't have to be done again unless the database changes. As an illustration, we build a vantage point tree for the RDP database of bacterial 16S sequences. This database contains 281261 sequences of which 39237 are duplicates. After removing these duplicates, we are left with 242024 unique sequences. Building a tree for these sequences is done with:

  $ vpsearch build rdp_download_281261seqs_dedup.fa
  Building for 242024 sequences...
  done.
  Linearizing...done.
  Database created in rdp_download_281261seqs_dedup.db

For the RDP database of full length sequences, this takes about 20 minutes on a standard machine. When only selected regions of the sequences are considered, the time needed to build a tree can be much reduced. For example, vantage point trees for the v1-v2 hypervariable region (350 base pairs) or the v3-v4 region (250 base pairs) of the RDP 16S sequencese can be built in 30 seconds to 1 minute.

Once a tree has been built, unknown sequences can be looked up using the vpsearch query command. Here we supply a query file with a single sequence

  vpsearch query rdp_download_281261seqs_dedup.fa query.fa
  query	S000143715	99.54	1529	0	0	1	1524	1	1529	0	7546
  query	S004085923	99.08	1529	0	0	1	1524	1	1526	0	7481
  query	S004085922	99.08	1529	0	0	1	1524	1	1526	0	7481
  query	S004085925	98.50	1531	0	0	1	1524	1	1527	0	7386

By default, the vpsearch query command outputs the best four matches in the database per query sequence (the number of matches can be changed with the -k parameter). Lookup is done one query sequence at a time, but multiple queries can be considered in parallel by enabling multiple threads; use the -j option to specify the number of threads.

The vpsearch query command attempts to output its results in the standard BLAST tabular format. The interpretation of the columns is as follows:

Column name Example Notes
query ID query
subject ID S000143715
% identity 99.54
alignment length 1529
mismatches 0 currently not implemented
gap openings 0 currently not implemented
query start 1
query end 1524
subject start 1
subject end 1529
E-value 0 N/A (always 0)
bit score 7546 interpreted as the alignment score

Note that the number of mismatches and gap openings are currently not displayed in the result output. This will be addressed in a future version of the package.

Installation

Using EDM

Users of the Enthought Deployment Manager(EDM) can install the necessary prerequisites (Click, Cython, Numpy, and Parasail) by importing an EDM environment from the bundle file shipped with this repository

  edm env import -f <bundle.json> vpsearch

where <bundle.json> is one of vpsearch_py3.6_osx-x86_64.json or vpsearch_py3.6_rh6-x86_64.json, depending on your platform.

When this is done, activate the environment, and install this package. From the root of this repository, run

  edm shell -e vpsearch
  pip install -e .

Using Pip, Conda, etc.

Users of other package installation tools, such as Pip or Conda, need to install the Parasail library following the instructions on the Parasail web page. Once that is done, the Python dependencies can be installed using the appropriate command for your package manager. For pip, for example, this can be done with

  pip install -r requirements.txt

Once that is done, activate your virtual environment, and install this package via

  pip install -e .

Using Docker

It is possible to build a Docker image that contains vpsearch as well as all of its dependencies. This is useful, for example, when integrating vpsearch into a workflow manager, like Snakemake, CWL, or WDL.

To build the image, run the following command from the root of this repository:

  docker build . -t vpsearch-image

Once the image has been built, vpsearch can then be run from within a container. Assuming you have a FASTA file of target sequences in the file database.fasta in the current directory, run the following to build a vpsearch index:

  docker run -it -v $PWD:/data -t vpsearch-image vpsearch build /data/database.fasta

To query the index for a given FASTA file query.fasta of query sequences, run:

  docker run -it -v $PWD:/data -t vpsearch-image vpsearch query /data/database.db /data/query.fasta

Troubleshooting

The vpsearch package relies on the Parasail C library for alignment. If building the package fails because the Parasail library cannot be found, you can manually specify the location of the Parasail include files and shared object libraries by setting the PARASAIL_INCLUDE_DIR and PARASAIL_LIB_DIR environment variables before building the package:

  export PARASAIL_INCLUDE_DIR=/location/of/parasail/include/files
  export PARASAIL_LIB_DIR=/location/of/parasail/lib/files
  pip install -e .

Note that if Parasail is installed in a non-standard location, you may have to set the LD_LIBRARY_PATH variable at runtime.

Implementation notes

The tree construction operates in two phases. We first build the tree as a tree of Python object nodes because it's easier to build with a dynamic data structure. Then it linearizes the topology of the nodes into a few integer arrays that are easy to serialize and fast to look up. The object that represents the linearized tree can only query the database, not build the tree. The slower tree-of-nodes implementation can build and query (albeit with more overhead).

Building wheels

Wheels for this package can be built in a platform-independent way using cibuildwheel, running under GitHub actions. As an administrator, you can start a workflow to build wheels by selecting the "Build wheels" action from the GitHub actions menu, and clicking the "Run workflow" button. When the workflow completes, wheels for Linux and macOS will be available as a zipped artifact.

It is possible to run cibuildwheels locally, but only to build wheels for Linux. In a clean Python environment, run pip install cibuildwheel to install the tool, followed by e.g.

  CIBW_BUILD_VERBOSITY=1 \
  CIBW_BUILD=cp38-manylinux_x86_64 \
  CIBW_BEFORE_BUILD="./ci/build-parasail.sh" \
  python -m cibuildwheel --output-dir wheelhouse --platform linux

to build Python 3.8 wheels for Linux. By varying the build tag, wheels for other Python versions can be built.

License

This package is licensed under the BSD license.

References

Vantage point trees were introduced in

Uhlmann, Jeffrey (1991). "Satisfying General Proximity/Similarity Queries with Metric Trees". Information Processing Letters. 40 (4): 175–179. doi:10.1016/0020-0190(91)90074-r.

Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces (PDF). Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321. pny93.

The Parasail library is described in

Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vpsearch-0.1.2.tar.gz (218.2 kB view details)

Uploaded Source

Built Distributions

vpsearch-0.1.2-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.0 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ x86-64

vpsearch-0.1.2-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl (13.5 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.12+ i686

vpsearch-0.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.0 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

vpsearch-0.1.2-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl (13.5 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686

vpsearch-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

vpsearch-0.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.0 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

vpsearch-0.1.2-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl (13.5 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

vpsearch-0.1.2-cp38-cp38-macosx_10_9_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (14.9 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.whl (13.4 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

vpsearch-0.1.2-cp37-cp37m-macosx_10_9_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (14.9 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.whl (13.4 MB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ i686

vpsearch-0.1.2-cp36-cp36m-macosx_10_9_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.6m macOS 10.9+ x86-64

File details

Details for the file vpsearch-0.1.2.tar.gz.

File metadata

  • Download URL: vpsearch-0.1.2.tar.gz
  • Upload date:
  • Size: 218.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for vpsearch-0.1.2.tar.gz
Algorithm Hash digest
SHA256 abe696b333de6e5b3a24e4ac918db3153a05c839c10a2d6bdb47122b5f150473
MD5 35c7ebd7773d84857cf61752296c1ea4
BLAKE2b-256 d7abb22b637416e1212e1e26a2bc309b2be7f9fb22506c714bace85d61de937e

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 46d7debb19b68c7e70708b1dc2e95fdde29d2d8bdbdad3bf1e142518c2c4ac15
MD5 84567c83a74b70c98b143b2e24a5efa3
BLAKE2b-256 0b954edc8a35037e96760b096178ea90a01d3e94c62e74a7b302692480586946

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 ddbb3b1b376fd65a6bba48d1ddcdf80dc374c2d7785aa2211c9db9018d8bf547
MD5 76da5c2272f7674e9b73d4f3695b1448
BLAKE2b-256 eb3688f058ed4c513d24dcc6fe3e18c63596aa4699de110f34b1bc625f131188

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 1643d9da46e412aa1ef50a329dc8222dc5aa8a8053a6237fe4aa595ba8daa713
MD5 cec04f0647c754e130cfd670dfc018ba
BLAKE2b-256 943ab049d909b725a28727f5d5e5ba3c37a895faacd53aa70099779716da3cf7

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 e2b007286b360fcd61820bb47856b1e907709a46e5e5d7ae49e71292f9d5feb1
MD5 18819e810355f128fe3594ace8656740
BLAKE2b-256 a4dae29d87956453ad4619fca65f702a74785be981f94d499ac86fd8240705fb

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vpsearch-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for vpsearch-0.1.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b8780b0f5ccde28fa11123c8307858c4deedb4e676d981249011c7a562b31461
MD5 a50f7e3162181902f09a6c4e9b098c5f
BLAKE2b-256 9aaa21b6719581b3cc3ab25b25b001c5a1f41f8a24d9897a8911f485540e505f

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b0610d7de925954f1f4636ea1f0c513f82a0d78997ccfe832800009d9318b684
MD5 9fe97d51b7b135c05a42cda7b51dab21
BLAKE2b-256 8ebb76469bb83d8ff06022ee77bcddf0c75ac703771c553aa22a7edb11afd2a2

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 7124427dfa57b3da70a1be88b079a8cfa360ca83ec0caac825b93ac394fb69ea
MD5 7be1f299cd546af73a14c7e08fd3b953
BLAKE2b-256 5801fc5ef9ab589a1636e122ffa5007b527f1e3c9370befb34042c1f786bb5e2

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vpsearch-0.1.2-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for vpsearch-0.1.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 75904c78bc0523b7572085a8c510e2666a81421b1804e447d41fc770d0d81382
MD5 207eaa4986e2fae9d3b89f9d8f7347b7
BLAKE2b-256 6f9075cf49190f2ea5217d0fc92598d30677d96551a122b6bad8ebbf58145f48

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 0468baa453f0df85264487a18ab94a6281bb0f864854f9c6f0bc1d8fa997cff2
MD5 4b7e7d76c23bfa38ab83aad42c01844a
BLAKE2b-256 ac98f77358b8cf2711a31ce89042b25d82c1a5cdcf8f6b46813c5ec4cd150820

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 3968860fa3838a9588c9de977bad0a9a2ca73fdf8a47534fffc5ec8a6d30225d
MD5 6b5955ce55620659cc441c52b106e44b
BLAKE2b-256 37b870fc9e835d3ee411a049f5025417c4831cb47ae3ed71743edd26dc477386

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vpsearch-0.1.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for vpsearch-0.1.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2a68ee6957011f2d8853381134e2e4bcdf76b91e6cd01a01bd0e068ff82aa1f8
MD5 fe6febb74b9b5aa2e19459a1b595031d
BLAKE2b-256 aecb8ec68589b09736bd662a786d14280e0267acd263fa92ed6dbe6ff7e6705a

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 064994ea5e8d3eedcc9d45d8980fe8e943b71eb56427a7e2b86cd784a35b0f76
MD5 031fdc2cb65b4d2b3c3cb583694e2f83
BLAKE2b-256 66096c46a7dbf272d640f46a67ed7ce924b5fb355b9bad7211abfca7d25db630

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.whl.

File metadata

File hashes

Hashes for vpsearch-0.1.2-cp36-cp36m-manylinux_2_12_i686.manylinux2010_i686.whl
Algorithm Hash digest
SHA256 7ba04508986dd7ead9078a89f89a8604305d683b8143f80cb2769f1e58ad586b
MD5 6f4bd44c530d97c1ee8426303dc53edd
BLAKE2b-256 fc8fed08a0b6a86da17e11663028127976568cb14ba0d51728658fd030ff05c6

See more details on using hashes here.

File details

Details for the file vpsearch-0.1.2-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: vpsearch-0.1.2-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 2.7 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for vpsearch-0.1.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c3b4fb5fc2c2773e99f5566e452a17d0cfa8df1ea245e12ade125f31fed22c3b
MD5 b7c9278f23e5ed6ac909ffda17f066cf
BLAKE2b-256 fb1e39e4eb5332c9a046ae15da1fa9ee187632623a7eae7eb38761f5c157b69b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page