Skip to main content

Global-Global genetic database search.

Project description


vpsearch - Fast Vantage-Point Tree Search for Sequence Databases

This is a package for indexing and querying a sequence database for fast nearest-neighbor search by means of vantage point trees. For reasonably large databases, such as RDP, this results in sequence lookups that are typically 5-10 times faster than other alignment-based lookup methods.

Vantage-point tree search uses global-to-global alignment to compare sequences, rather than seed-and-extend approximative methods as used for example by BLAST.

Usage

Given a sequence database (in FASTA format), vpsearch build constructs an optimized vantage point search tree. Building the tree is a one-time operation and doesn't have to be done again unless the database changes. As an illustration, we build a vantage point tree for the RDP database of bacterial 16S sequences. This database contains 281261 sequences of which 39237 are duplicates. After removing these duplicates, we are left with 242024 unique sequences. Building a tree for these sequences is done with:

  $ vpsearch build rdp_download_281261seqs_dedup.fa
  Building for 242024 sequences...
  done.
  Linearizing...done.
  Database created in rdp_download_281261seqs_dedup.db

For the RDP database of full length sequences, this takes about 20 minutes on a standard machine. When only selected regions of the sequences are considered, the time needed to build a tree can be much reduced. For example, vantage point trees for the v1-v2 hypervariable region (350 base pairs) or the v3-v4 region (250 base pairs) of the RDP 16S sequencese can be built in 30 seconds to 1 minute.

Once a tree has been built, unknown sequences can be looked up using the vpsearch query command. Here we supply a query file with a single sequence

  vpsearch query rdp_download_281261seqs_dedup.fa query.fa
  query	S000143715	99.54	1529	0	0	1	1524	1	1529	0	7546
  query	S004085923	99.08	1529	0	0	1	1524	1	1526	0	7481
  query	S004085922	99.08	1529	0	0	1	1524	1	1526	0	7481
  query	S004085925	98.50	1531	0	0	1	1524	1	1527	0	7386

By default, the vpsearch query command outputs the best four matches in the database per query sequence (the number of matches can be changed with the -k parameter). Lookup is done one query sequence at a time, but multiple queries can be considered in parallel by enabling multiple threads; use the -j option to specify the number of threads.

The vpsearch query command attempts to output its results in the standard BLAST tabular format. The interpretation of the columns is as follows:

Column name Example Notes
query ID query
subject ID S000143715
% identity 99.54
alignment length 1529
mismatches 0 currently not implemented
gap openings 0 currently not implemented
query start 1
query end 1524
subject start 1
subject end 1529
E-value 0 N/A (always 0)
bit score 7546 interpreted as the alignment score

Note that the number of mismatches and gap openings are currently not displayed in the result output. This will be addressed in a future version of the package.

Installation

Using EDM

Users of the Enthought Deployment Manager(EDM) can install the necessary prerequisites (Click, Cython, Numpy, and Parasail) by importing an EDM environment from the bundle file shipped with this repository

  edm env import -f <bundle.json> vpsearch

where <bundle.json> is one of vpsearch_py3.6_osx-x86_64.json or vpsearch_py3.6_rh6-x86_64.json, depending on your platform.

When this is done, activate the environment, and install this package. From the root of this repository, run

  edm shell -e vpsearch
  pip install -e .

Using Pip, Conda, etc.

Users of other package installation tools, such as Pip or Conda, need to install the Parasail library following the instructions on the Parasail web page. Once that is done, the Python dependencies can be installed using the appropriate command for your package manager. For pip, for example, this can be done with

  pip install -r requirements.txt

Once that is done, activate your virtual environment, and install this package via

  pip install -e .

Troubleshooting

The vpsearch package relies on the Parasail C library for alignment. If building the package fails because the Parasail library cannot be found, you can manually specify the location of the Parasail include files and shared object libraries by setting the PARASAIL_INCLUDE_DIR and PARASAIL_LIB_DIR environment variables before building the package:

  export PARASAIL_INCLUDE_DIR=/location/of/parasail/include/files
  export PARASAIL_LIB_DIR=/location/of/parasail/lib/files
  pip install -e .

Note that if Parasail is installed in a non-standard location, you may have to set the LD_LIBRARY_PATH variable at runtime.

Implementation notes

The tree construction operates in two phases. We first build the tree as a tree of Python object nodes because it's easier to build with a dynamic data structure. Then it linearizes the topology of the nodes into a few integer arrays that are easy to serialize and fast to look up. The object that represents the linearized tree can only query the database, not build the tree. The slower tree-of-nodes implementation can build and query (albeit with more overhead).

License

This package is licensed under the BSD license.

References

Vantage point trees were introduced in

Uhlmann, Jeffrey (1991). "Satisfying General Proximity/Similarity Queries with Metric Trees". Information Processing Letters. 40 (4): 175–179. doi:10.1016/0020-0190(91)90074-r.

Yianilos (1993). Data structures and algorithms for nearest neighbor search in general metric spaces (PDF). Fourth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 311–321. pny93.

The Parasail library is described in

Daily, Jeff. (2016). Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 17(1), 1-11. doi:10.1186/s12859-016-0930-z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for vpsearch, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size vpsearch-0.1.1.tar.gz (219.0 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page