Skip to main content

Compact Bit-Sliced Signature Index (COBS)

Project description

Compact Bit-Sliced Signature Index (COBS)

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

cobs-architecture

COBS has two interfaces:

More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

To download and install COBS run:

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake ..
make -j4

and optionally run make test to check the build.

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), McCortex files (*.ctx), or text files (*.txt).

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Check --help for many options.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Python Interface

COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

cobs-experiments-scaling cobs-experiments-scaling-per-documents

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cobs_index-0.1.2.tar.gz (2.7 MB view details)

Uploaded Source

Built Distributions

cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl (971.8 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl (974.6 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl (974.4 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl (974.4 kB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl (974.3 kB view details)

Uploaded CPython 3.4m manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl (975.1 kB view details)

Uploaded CPython 2.7mu manylinux: glibc 2.12+ x86-64

cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl (975.1 kB view details)

Uploaded CPython 2.7m manylinux: glibc 2.12+ x86-64

File details

Details for the file cobs_index-0.1.2.tar.gz.

File metadata

  • Download URL: cobs_index-0.1.2.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3767b0c42ed1902dc9fdff7b98efb849ec87f7086638a16f1316e37c102e7b1b
MD5 597cc7f0f5e4664798fabbd584c6ada2
BLAKE2b-256 016f187f5e6214d5217a0cb93acd96b30a576f55fcebcccb1f5dc00ca952f5a8

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 971.8 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 736753e494dbcaaaf88ca2f9c3eb5e87c3bca0355dec4bbd536724b4e0f130eb
MD5 56954b9113ab545b2d876abd1d5ee828
BLAKE2b-256 b93b57250988b681679f70caaeb38faa4d5b67ab054b7cf08dd1cad4309d5c49

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 974.6 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a87315eabd4dce723a3ecf781a665be3ca4ea0195f6be11abc13f00c855e58b7
MD5 724bd83cf95ad8e37070d31bbf78e7d0
BLAKE2b-256 95ff742c296fd36663f80445c96aa325c2e8a28f9f83ca188c4f016e9f35ab0b

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 974.4 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 f439fb0cf0af6f37361d27c29acd5d1c880c24706fcfbc5dbdccfc190ac000c1
MD5 64ce109c29ae58620811e9eb36b06e57
BLAKE2b-256 0bd070eab004eb1114449e75d5213538b4e78c3c113552b96f65c7bbfafc3e3c

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 974.4 kB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 373c03e4b0940d6c5a1d729ff0e19208bc64042e4d69cc2d094a5549ebee1f6a
MD5 445d725e4e51f91f4f3a9af9b9440744
BLAKE2b-256 5c2da3d30440ef0441e90508ea60995306f05a8e487dcfcd11268bf3b5d4e957

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 974.3 kB
  • Tags: CPython 3.4m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp34-cp34m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 5cff12d75321ba5a1b76bc9053cb678605ea786354a1c6842dd49ba5bb3b4654
MD5 7a7ca55f4e89088c9da311fc556639d3
BLAKE2b-256 feec567b1fef709483eccee677db7674a3523686693dbcc1c8c371e4ee984473

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 975.1 kB
  • Tags: CPython 2.7mu, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp27-cp27mu-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2cb0c18cd49d0b375ee869dac951469bcc96a81ef93f0a2b505b9c5823b644f8
MD5 c866e9b2308d17aa624198ac35a6727a
BLAKE2b-256 7b74701f80ac48d2f75f15950345bea393097ec604f52e48a1c5a2f36674a11b

See more details on using hashes here.

File details

Details for the file cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 975.1 kB
  • Tags: CPython 2.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.3 requests-toolbelt/0.9.1 tqdm/4.37.0 CPython/3.6.9

File hashes

Hashes for cobs_index-0.1.2-cp27-cp27m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8171c3bc0820d3f2996787dc2e082d0ac750ec759a4aa61c162547870f7416a6
MD5 de72451136c646c2d2195e4ddfc8b897
BLAKE2b-256 5e55b6ff9611d823fe1f749f1f59af69c9310ea64561da2c32fd09af73746bea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page