Skip to main content

A BloomFilterTrie implementation to be generally applicable for genomic applications.

Project description

kcollections

Build Status

A fast and efficient library for storing k-mers in c++ and python.

About

kcollections is a python-like dict and set that efficiently store kmers as keys or elements of the collection. It is designed for building/prototyping Bioinformatics tools that rely on kmers but where the included dict and set consume too much memory for use.

It implements the Bloom Filter Trie algorithm. This implementation differs from Guillaume et al. by allowing kmers of arbitrary size and by providing a generic dictionary/map data structure for associating arbitrary values with kmers.

kcollections is currently only available for Python version 2.7.

Installation

We provide some pre-compiled binaries for python 2/3 and Linux and MacOS:

pip install kcollections

Alternatively, you can build from source.

Build from source

Prerequisites include:

  • CMake, must be installed manually in order to build the code.
  • boost, at least version 1.65.1
  • uint256_t, included in the repository.
  • pybind11, included in the repository.

To build and install the python module from source:

git clone https://github.com/masakistan/kcollections.git
cd kcollections

# python 3
python3 setup.py install
# or
pip3 install .

# python 2
python2 setup.py install
# or
pip2 install .

Additionally, if you would like to access the functions from a C++ program, you can build static libraries with the following steps:

git clone https://github.com/masakistan/kcollections.git
cd kcollections

mkdir build
cd build

cmake ..
make

Example Usage

Using Kset

Serial Insertion, writing/reading to/from disk

Kmers can be added one at a time with add, but the fastest way to add kmers to a set is to add a DNA sequence using add_seq.

import kcollections
ks = kcollections.Kset(27)

# add single kmer
ks.add('AAACTGTCTTCCTTTATTTGTTCAGGG')

# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
ks.add_seq(seq, len(seq))

assert 'AAACTGTCTTCCTTTATTTGTTCAGGG' in ks
# iteration
for kmer in ks:
    print kmer

ks.write("data.bs")
del ks

# sometime later

ks = Kcollections.Kset()
ks.read("data.bs")

Parallel Insertion

The fastest way to use Kset is to use multithreaded insertion. Multithreaded approach is best used when all kmers are loaded upfront. Kmers not accessible until the threads have been joined using parallel_add_join. See the example below on how parallel and serial insertions can be used.

import kcollections
ks = kcollections.Kset(27)

# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
ks.parallel_add_init(16)

# insert a sequence of kmers
ks.parallel_add_seq(seq)

# insert a single kmer
ks.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')

# merge threads together
# no parallel add methods can be used after joining
ks.parallel_add_join()

# serial add is permissible after threads have joined
ks.add('AAACTGTCTTCCTTTATTTGTTCACAG')

# iteration
for kmer in ks:
    print(kmer)
print len(ks)

Using Kdict

Serial Insertion

import kcollections
kd = kcollections.Kdict(str, 27)

# insertion and value assignment
kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] = 'banana'
kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] = 'phone'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 'banana'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] == 'phone'

# iteration
for kmer, val in kd.items():
    print(kmer, val)

# removal
del kd['AAACTGTCTTCCTTTATTTGTTCAGGT']

Parallel Insertion

Parallel insertion for Kdict requires the inclusion of a merging function to resolve different values for the same key. The following snippet adds 27mers from a string of DNA using a provided lambda function to merge value conflicts. This merge function simply keeps the newest value associated with the kmer. More examples of merging functions with Kdict can be found here.

kd = kcollections.Kdict(int, 27)
kd.parallel_add_init(4)
kd.set_merge_func(lambda prev_val, new_val: new_val)
kd.parallel_add_seq(dna, generate_idx(len(dna)))
kd.parallel_add_join()

Using Kcounter

Kcounter is an implementation of the Python collection's Counter, but the keys must be kmers, of course! Like Kdict, kmers can be added to Kcounter one at a time, but the fastest ways to add kmers to a set is to add an DNA sequence using add_seq (or parallel_add_seq for multithreaded inserts).

Serial Insertion

from kcollections import Kcounter
kc = Kcounter(27)

# add single kmer
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1

# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
kc.add_seq(seq)

assert kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 2

# iteration
for kmer, count in kc.items():
    print(kmer, count)

Parallel Insertion

from kcollections import Kcounter
kc = Kcounter(27)

# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
kc.parallel_add_init(16)

# insert a sequence of kmers
kc.parallel_add_seq(seq)

# insert a single kmer
kc.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')

# merge threads together
# no parallel add methods can be used after joining
kc.parallel_add_join()

# updates can be made after the join
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1

# iteration
for kmer, count in kc.items():
    print(kmer, count)

Performance

kcollections is quite a bit slower than the dict or set but is much more memory-efficient. We measured memory usage and running time using /usr/bin/time -v on aIntel(R) Xeon(R) E5-2650v4 @2.20GHz with 256 GB RAM. 27mers used for testing were taken from the human genome, no repetitive kmers appear in our dataset providing a worst case scenario where no insertions or queries are pruned before traversing the entire data structure.

Memory Usage

# kmers kset set
1 million 25.32 MB 105.82 MB
10 million 63.74 MB 906.96 MB
100 million 0.56 GB 11.98 GB
500 million 2.42 GB 48.54 GB
1 billion 4.43 GB 97.07 GB
1.5 billion 6.44 GB 191.61 GB
2 billion 8.44 GB 194.14 GB
2.4 billion 10.08 GB 220.06 GB

Figure of memory usage

Insertion Time

Insertion time comparisons using built-in Python set, kcollections serial and parallel insert.

Figure of insertion time

Read Mapper and Assembler

An example read mapping algorithm and assembler are provided using kcollections in the applications directory.

Citation

Acknowledgements

kcollections was built at the Computational Science Laboratory at Brigham Young University by Stanley Fujimoto (@masakistan) and Cole Lyman (@colelyman).

Funding was provided by the Utah NASA Space Grant Consortium and the BYU Graduate Research Fellowship.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kcollections-1.0.5-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7m

kcollections-1.0.5-cp37-cp37m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7mmacOS 10.13+ x86-64

kcollections-1.0.5-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m

kcollections-1.0.5-cp36-cp36m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6mmacOS 10.13+ x86-64

kcollections-1.0.5-cp35-cp35m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m

kcollections-1.0.5-cp35-cp35m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5mmacOS 10.13+ x86-64

kcollections-1.0.5-cp34-cp34m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.4m

kcollections-1.0.5-cp34-cp34m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.4mmacOS 10.13+ x86-64

kcollections-1.0.5-cp27-cp27m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 2.7m

kcollections-1.0.5-cp27-cp27m-macosx_10_13_x86_64.whl (1.1 MB view details)

Uploaded CPython 2.7mmacOS 10.13+ x86-64

File details

Details for the file kcollections-1.0.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.12

File hashes

Hashes for kcollections-1.0.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 905d968a36ac5bad1d0a9122945d4a5e5ab350d7511530b2f210ffcab796e1a8
MD5 4ce3e2a929a77fa6b9be77fa0f8715aa
BLAKE2b-256 4e9381513fa3887891074c9c3484608236f8f8579cd4e3f7f69ea77dfff58ff3

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for kcollections-1.0.5-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 2a066252f125e9f191ac56698ad7363ee5a43748d96f759cc58ca8449ee3a8e1
MD5 f665df36117c8b4f67094eb9d4cb7d5d
BLAKE2b-256 cc30c36c91be8496a1a00aed25f3a6a80bc3239d4fb504cf57a463495fee9dad

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.12

File hashes

Hashes for kcollections-1.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d90afb7ff8dedbf125edbeb128ceb3303625f8a7729cba097d686fd4a49b954f
MD5 d92ed946929bf6bb072d12aa9ab71b5a
BLAKE2b-256 1f4a5da571afa2d53f7c61b3f42debaea038a83a8aaf58e4f2ba3f3353c1f58c

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp36-cp36m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp36-cp36m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.9

File hashes

Hashes for kcollections-1.0.5-cp36-cp36m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 dfb232bfe4ad13df0fc4f3f24b6f77dc78951c40278856abc0dd6be5d7c24d5a
MD5 c68061a19981dc7a87a1e1f3a081dcdc
BLAKE2b-256 3fe99457406d876d3eb86234c6d6991a902a42c7ccd158199ad43b9a06fcbc3f

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.12

File hashes

Hashes for kcollections-1.0.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8c51cce33891b07669fc8c8ce27a7cf826e2dd63d30408aa00aa7bbd535f9c7a
MD5 0ba8dc193e3bab2ba0331458b454e5be
BLAKE2b-256 79a2787bf3f5026fd11e4010a56f68579220bbcd9d5588a3977c0621eb7e0297

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp35-cp35m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp35-cp35m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.5.8

File hashes

Hashes for kcollections-1.0.5-cp35-cp35m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 9ed3bc1d7697d1e3517328b1f9258a2fa7e7911391c68159102f21960af34ccb
MD5 5a31c706271311b8c9ecadd8bffdfac7
BLAKE2b-256 5f354f93ffd20d7567cf5098375bf8ac71e0cbaf82d8a1325be2df750795d4e7

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.12

File hashes

Hashes for kcollections-1.0.5-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0d06158bdc79a75ef6ced73a705cc1585fc354a7e9fb58c21ceb47a138722f6f
MD5 70ac0aca0102bff91089ea736734324e
BLAKE2b-256 9c45911c45590771a5d3c963f6215e385c10914a26f5dcdd5642437526f4e95c

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp34-cp34m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp34-cp34m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.4m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/43.0.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.4.10

File hashes

Hashes for kcollections-1.0.5-cp34-cp34m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 b186b0dbcb6b2afc6ccc1055f7cd808363e7caf95bcc35985b3bc36a8d163e27
MD5 303ae03ba1b514775d0fb7dbf0c3482c
BLAKE2b-256 51700db95a68d902af0f77b5c9aee4e019b93914147e2e1cf3fffddb3699c5bc

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.12

File hashes

Hashes for kcollections-1.0.5-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 800da35c3d195fed124b71382432b03b016d80a7fe1759f61523e6771e2d0b65
MD5 69a4fc558e14a0179b90d69f4b534fe3
BLAKE2b-256 8cda7b99b0324a31df5df46095caa29b1e61ff08fd29ed8d0c8443bd22377fae

See more details on using hashes here.

File details

Details for the file kcollections-1.0.5-cp27-cp27m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: kcollections-1.0.5-cp27-cp27m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 2.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/2.7.17

File hashes

Hashes for kcollections-1.0.5-cp27-cp27m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 97d91fd04997218796fd2b1baf3d7844cb56a7bc3be8c6cc35275a0d4ce6ed19
MD5 ea804feee8c24336e4d47adbb48ccd00
BLAKE2b-256 9d99275299aac7724e41fec452ba46126e181ec24b67d934be7b1a448c92a10d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page