Compact Bit-Sliced Signature Index (COBS)
Project description
Compact Bit-Sliced Signature Index (COBS)
COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.
COBS has two interfaces:
- a command line tool in C++ called
cobs
(see below) - a Python interface to the C++ library (see https://bingmann.github.io/cobs-python-docs/)
More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.
Installation and First Steps
Installation
COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.
To download and install COBS run:
git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake ..
make -j4
and optionally run make test
to check the build.
Building an Index
COBS can read FASTA files (*.fa
, *.fasta
, *.fa.gz
, *.fasta.gz
), FASTQ files (*.fq
, *.fastq
, *.fq.gz.
, *.fastq.gz
), McCortex files (*.ctx
), or text files (*.txt
).
You can either recursively scan a directory for all files matching any of these files, or pass a *.list
file which lists all paths COBS should index.
To check the document list to be indexed, run for example
src/cobs doc-list tests/data/fasta/
To construct a compact COBS index from these seven example documents run
src/cobs compact-construct tests/data/fasta/ example.cobs_compact
Check --help
for many options.
Query an Index
COBS has a simple command line query tool:
src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT
or a fasta file of queries with
src/cobs query -i example.cobs_compact -f query.fa
Python Interface
COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.
Experimental Results
In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file cobs_index-0.1.2.tar.gz
.
File metadata
- Download URL: cobs_index-0.1.2.tar.gz
- Upload date:
- Size: 2.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3767b0c42ed1902dc9fdff7b98efb849ec87f7086638a16f1316e37c102e7b1b |
|
MD5 | 597cc7f0f5e4664798fabbd584c6ada2 |
|
BLAKE2b-256 | 016f187f5e6214d5217a0cb93acd96b30a576f55fcebcccb1f5dc00ca952f5a8 |
File details
Details for the file cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
- Upload date:
- Size: 971.8 kB
- Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 736753e494dbcaaaf88ca2f9c3eb5e87c3bca0355dec4bbd536724b4e0f130eb |
|
MD5 | 56954b9113ab545b2d876abd1d5ee828 |
|
BLAKE2b-256 | b93b57250988b681679f70caaeb38faa4d5b67ab054b7cf08dd1cad4309d5c49 |
File details
Details for the file cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
- Upload date:
- Size: 974.6 kB
- Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a87315eabd4dce723a3ecf781a665be3ca4ea0195f6be11abc13f00c855e58b7 |
|
MD5 | 724bd83cf95ad8e37070d31bbf78e7d0 |
|
BLAKE2b-256 | 95ff742c296fd36663f80445c96aa325c2e8a28f9f83ca188c4f016e9f35ab0b |
File details
Details for the file cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
- Upload date:
- Size: 974.4 kB
- Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f439fb0cf0af6f37361d27c29acd5d1c880c24706fcfbc5dbdccfc190ac000c1 |
|
MD5 | 64ce109c29ae58620811e9eb36b06e57 |
|
BLAKE2b-256 | 0bd070eab004eb1114449e75d5213538b4e78c3c113552b96f65c7bbfafc3e3c |
File details
Details for the file cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
- Upload date:
- Size: 974.4 kB
- Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 373c03e4b0940d6c5a1d729ff0e19208bc64042e4d69cc2d094a5549ebee1f6a |
|
MD5 | 445d725e4e51f91f4f3a9af9b9440744 |
|
BLAKE2b-256 | 5c2da3d30440ef0441e90508ea60995306f05a8e487dcfcd11268bf3b5d4e957 |
File details
Details for the file cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl
- Upload date:
- Size: 974.3 kB
- Tags: CPython 3.4m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cff12d75321ba5a1b76bc9053cb678605ea786354a1c6842dd49ba5bb3b4654 |
|
MD5 | 7a7ca55f4e89088c9da311fc556639d3 |
|
BLAKE2b-256 | feec567b1fef709483eccee677db7674a3523686693dbcc1c8c371e4ee984473 |
File details
Details for the file cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl
- Upload date:
- Size: 975.1 kB
- Tags: CPython 2.7mu, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cb0c18cd49d0b375ee869dac951469bcc96a81ef93f0a2b505b9c5823b644f8 |
|
MD5 | c866e9b2308d17aa624198ac35a6727a |
|
BLAKE2b-256 | 7b74701f80ac48d2f75f15950345bea393097ec604f52e48a1c5a2f36674a11b |
File details
Details for the file cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl
.
File metadata
- Download URL: cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl
- Upload date:
- Size: 975.1 kB
- Tags: CPython 2.7m, manylinux: glibc 2.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8171c3bc0820d3f2996787dc2e082d0ac750ec759a4aa61c162547870f7416a6 |
|
MD5 | de72451136c646c2d2195e4ddfc8b897 |
|
BLAKE2b-256 | 5e55b6ff9611d823fe1f749f1f59af69c9310ea64561da2c32fd09af73746bea |