Skip to main content

Compact Bit-Sliced Signature Index (COBS)

Project description

Compact Bit-Sliced Signature Index (COBS)

CI

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

cobs-architecture

COBS has two interfaces: ( Coverage Status )

More information about COBS is presented in our current research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. "COBS: a Compact Bit-Sliced Signature Index". In: 26th International Symposium on String Processing and Information Retrieval (SPIRE). pages 285-303. Spinger. October 2019. preprint arXiv:1905.09624.

If you use COBS in an academic context or publication, please cite our paper

@InProceedings{bingmann2019cobs,
  author =       {Timo Bingmann and Phelim Bradley and Florian Gauger and Zamin Iqbal},
  title =        {{COBS}: a Compact Bit-Sliced Signature Index},
  booktitle =    {26th International Conference on String Processing and Information Retrieval (SPIRE)},
  year =         2019,
  series =       {LNCS},
  pages =        {285--303},
  month =        oct,
  organization = {Springer},
  note =         {preprint arXiv:1905.09624},
}

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

Linux

To download and install COBS run:

git clone --recursive https://github.com/iqbal-lab-org/cobs.git
mkdir cobs/build
cd cobs/build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j4

and optionally run make test to check the build. Note that to download submodules, --recursive has to be provided.

OS X compilation

Using clang:

  1. Install boost-1.76: brew install boost@1.76
  2. Compile COBS with boost: cmake ..

Troubleshooting

Several issues might arise from your specific configuration.

Problems with openMP on Mac OS X

If installing OpenMP does not work, add -DNOOPENMP=1 argument to the cmake command.

Problems with python bindings

Skip python bindings compilation by adding -DSKIP_PYTHON=1 argument to the cmake command.

Problems with finding boost

Define BOOST_ROOT env variable and then compile:

export BOOST_ROOT="/usr/local/opt/boost@1.76"  # use your boost root path - this would be the path if installing boost using brew on Mac OS X
cmake  ..

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fna, *.ffn, *.faa, *.frn, *.fa.gz, *.fasta.gz, *.fna.gz, *.ffn.gz, *.faa.gz, *.frn.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). See below on details how they are parsed.

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Or construct a compact COBS index from a list of documents by running

src/cobs compact-construct tests/data/fasta_files.list example.cobs_compact

The paths in the file list can be absolute or relative to the file list's path. Note that *.txt files are read as verbatim text files. You can force COBS to read a .txt file as a file list using --file-type list.

Check --help for many options.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Multiple indices can be queried at once by adding more -i parameters.

Python Interface

COBS also has a Python frontend interface which can be used to construct and query an index. See https://bingmann.github.io/cobs-python-docs/ for a tutorial.

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

cobs-experiments-scaling cobs-experiments-scaling-per-documents

More Details

File Types and How They Are Parsed

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), "Multi-FASTA" and "Multi-FASTQ" files (*.mfasta, *.mfastq), McCortex files (*.ctx), or text files (*.txt). Each file type is parsed slightly differently into q-grams or k-mers.

FASTA files are parsed as one document each. If a FASTA file contains multiple sequences or reads then they are combined into one document. Multiple sequences (separated by comments) are NOT concatenated trivially, instead the k-mers are extracted separately from each sequence. This means there are no erroneous k-mers from the beginning or end of crossing sequences. All newlines within a sequence are removed.

The k-mers from DNA sequences are automatically canonicalized (the lexicographically smaller is indexed). By adding the flag --no-canonicalize this process can be skipped. With canonicalization only ACGT letters are indexed, every other letter is mapped to binary zeros and index with the other data. A warning per FASTA/FASTQ file containing a non-ACGT letter is printed, but processing continues. With the flag --no-canonicalize any letters or text can be indexed.

FASTQ files are also parsed as one document each. The quality information is dropped and effectively everything is parsed identical to FASTA files.

Multi-FASTA or Multi-FASTQ files are parsed as many documents. Each sequence in the FASTA or FASTQ file is considered a separate document in the COBS index. Their names are append with _### where ### is the index of the subdocument.

McCortex files (*.ctx) contain a list of k-mers and these k-mers are indexes individually. The graph information is ignored. Only k=31 is currently supported.

Text files (*.txt) are parsed as verbatim binary documents. All q-grams are extracted, including newlines and other whitespace.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cobs_reloaded-0.1.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distributions

cobs_reloaded-0.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (954.2 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl (473.7 kB view details)

Uploaded PyPy macOS 11.0+ ARM64

cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl (526.0 kB view details)

Uploaded PyPy macOS 10.15+ x86-64

cobs_reloaded-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (954.1 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_11_0_arm64.whl (473.7 kB view details)

Uploaded PyPy macOS 11.0+ ARM64

cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_10_15_x86_64.whl (525.9 kB view details)

Uploaded PyPy macOS 10.15+ x86-64

cobs_reloaded-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (954.3 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (475.3 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

cobs_reloaded-0.1.0-cp312-cp312-macosx_10_15_x86_64.whl (528.9 kB view details)

Uploaded CPython 3.12 macOS 10.15+ x86-64

cobs_reloaded-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (954.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (474.9 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

cobs_reloaded-0.1.0-cp311-cp311-macosx_10_15_x86_64.whl (527.7 kB view details)

Uploaded CPython 3.11 macOS 10.15+ x86-64

cobs_reloaded-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (953.4 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (473.8 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

cobs_reloaded-0.1.0-cp310-cp310-macosx_10_15_x86_64.whl (526.4 kB view details)

Uploaded CPython 3.10 macOS 10.15+ x86-64

cobs_reloaded-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (953.7 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

cobs_reloaded-0.1.0-cp39-cp39-macosx_11_0_arm64.whl (473.9 kB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

cobs_reloaded-0.1.0-cp39-cp39-macosx_10_15_x86_64.whl (526.3 kB view details)

Uploaded CPython 3.9 macOS 10.15+ x86-64

File details

Details for the file cobs_reloaded-0.1.0.tar.gz.

File metadata

  • Download URL: cobs_reloaded-0.1.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for cobs_reloaded-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e7a3c4dd89fcf789e4d1d072a55e566440a99d1f318a1ea67ac56185609247f9
MD5 0a82a5275ee0d25ab9fe3f9f972130d4
BLAKE2b-256 38b7e1411aea4e814cb5fe6b2e23c94ec44f23db041dc8e73aaaea31bb089a65

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1cb4d4da5a54d37e5e2a0eba98c7d5ec6f433e9437abf12316fc20229db269a1
MD5 8d44dc859f4637cf882c5c8c1ff75060
BLAKE2b-256 9adc846cce5ad49ee74dc39683df12caab53ea2ff9a1b0acb70a0746977fd2d9

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ae78214e65579e23b874ae5a99261ca266ebbf5cbd3a0a6be797cd32695273c
MD5 0a81646db14ee07846f4fc35f853f63d
BLAKE2b-256 de188df76d3fdd5fc93d3a2e6c8dce9e2cc2217735fe594f5daa927ae8544609

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 23d5587cebdb9242079c7ed3698971568af132669e88a675101104eedb1bed5a
MD5 1a3ddb4a7ed30802e3028f9383312f49
BLAKE2b-256 22bf0fc70fea2fdd5467f3a3d519003fecc8d45954b9a8db0596450ac4f36a73

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ad4df220f0068adfde09fb0e918e1f0eaf8b083819d339ec9a2044fa9bda2ac9
MD5 8974571f1895d88c89ae33f312ccad3b
BLAKE2b-256 551cff2f4061f012475357262bf6c9afbaa17bb45e01259866bf9438802fa769

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0136949d061738b69101b8bc530be77bdc937f8aeeca3f2756f15653f9dd2b6c
MD5 9a635df9851ebfeda05e92651b533e71
BLAKE2b-256 e9474491310c352f3434b6bc7ba92902a96a70d39c492da9443c3a9ddfb9406d

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-pp39-pypy39_pp73-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2e108f75b68ffd425bd673d5cd0f8a9b4b2c53a5aebfc76022adfb2e45d59431
MD5 96747d16b1a303e95fee931a982f26f7
BLAKE2b-256 f1628c8ca46639ce9d2dc561b69d6d7c30c4fe2814d1d46b5f69f44ae16718f5

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0e9420ae3b8afce40c96febaf5443f457d27981b25a27758852046477a5060f0
MD5 2ee9b7256c8c7620d5cd5f76838be171
BLAKE2b-256 9bc3337a1f66b82abb4cb912c1cd2e63e10fb7cf6e1237f2586e7c7cc43df87a

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8bd6709997ffa5b4573c6e0919de9ae58e129c281cf3351f51c2cab9a163d18
MD5 050da59deaf4bd68d8fd4120c05ffb40
BLAKE2b-256 b6f0a75eb8fa4f990bd555aeefb6755ae493245a792368d708c5e524531450db

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp312-cp312-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp312-cp312-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 50d5433a7365405bce1bc0123f5d3bb47c748a54e7368b0ae957640e67fc13de
MD5 3322552bd211c9965d4484c739fff9b6
BLAKE2b-256 950911a8f292854f3afb7680e41af484deb506a77d48d177f0184b3c9ec520ff

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 61d311ea357a3086b4c30486d8d60821f06a39fbb57c0267428e657355788b1a
MD5 2e8dadd85448469cbd5ffab302f9c071
BLAKE2b-256 fc840e171c20ca63ac590796e6aefed414508ec23fca0385a1665794bb9f2447

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6554d84fe7b63756a1e2328174bcad340a51512e7546546d53b1fcd1f7057650
MD5 aa4254869d07b832c34c3f32d63c7a15
BLAKE2b-256 b52fdacc843975e4b5f9adaa6bead12431180c6bb05713782fd1d3ace42e4a56

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp311-cp311-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp311-cp311-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 7e4e49824c9059d361d5f7e86d9b4be5165a54bab32e30c3ef0f75b7e487d3ab
MD5 c53fae42adaaad4ea141c12ab20768d9
BLAKE2b-256 a21fb9247e4c842dd32677b565083e4477f70092ec5ac860ffdc5ee0e5215e5a

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b30bb6c39027980a47811b08656e7457c19a38a19dc1984f0cc58667144a2dd0
MD5 ddb8217e5475997f33788f8344d4dc9b
BLAKE2b-256 539d47ddf45b7283b0c9d1b59149bb24ecf7b5cbb6265536ac4a0abd134996f8

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1672503efb98445551a11f0fbf21ec25b2fa85c20b5179be628845dbc79b5801
MD5 d7075f57deafc8c30edb938ec947abbf
BLAKE2b-256 8a8e693cae21a1278f45de7db0e892e9644651a8799b1b5b8843af171ce194de

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp310-cp310-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp310-cp310-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 9bfb677357792ded94b408e97000dcc58968f82bfde44a470374d18260abba52
MD5 26ab6a93624ab225bee8b30817c12e38
BLAKE2b-256 4698bbaaa332125914842480cb9b78cf9af9e8a8b8078a6c2c434976c8e7fc43

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ac7c218009e50096ad552394575a435ab8de4af69bcb0234dd4c52ff85d9c4b2
MD5 e131413f9fea60f5ac6e39e57335c3a1
BLAKE2b-256 c40bb9cf9fb2332e2aeb70b53145b1c5f4bcc0636b17f2264087a695e6bfa0fa

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 07ab5171a277b9b96036def395a01789a5577fcce5c7581fbc311f4a00e2ccdf
MD5 9f5724ce4b10d7d8e749290e03c219bd
BLAKE2b-256 53f204dffb5b14b7cf12bcc6daa39f5ed781904fb9e90c9c8ad5ad2897c7acde

See more details on using hashes here.

File details

Details for the file cobs_reloaded-0.1.0-cp39-cp39-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for cobs_reloaded-0.1.0-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 01d5742aa38688b8b515d791817484ba63dc4272e2d007e540afc34642853000
MD5 45a0c5a73620cce554da1f5107ab73f4
BLAKE2b-256 15442a892381cc7afe5a06e9999ebea4c5797828d21f63b7f16cb8020ae19d68

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page