A BloomFilterTrie implementation to be generally applicable for genomic applications.
Project description
kcollections
A fast and efficient library for storing k-mers in c++ and python.
About
kcollections
is a python-like dict
and set
that efficiently store kmers as keys or elements of the collection.
It is designed for building/prototyping Bioinformatics tools that rely on kmers but where the included dict
and set
consume too much memory for use.
It implements the Bloom Filter Trie algorithm. This implementation differs from Guillaume et al. by allowing kmers of arbitrary size and by providing a generic dictionary/map data structure for associating arbitrary values with kmers.
kcollections
is currently only available for Python version 2.7.
Installation
We provide some pre-compiled binaries for python 2/3 and Linux and MacOS:
pip install kcollections
Alternatively, you can build from source.
Build from source
Prerequisites include:
- CMake, must be installed manually in order to build the code.
- boost, at least version 1.65.1
- uint256_t, included in the repository.
- pybind11, included in the repository.
To build and install the python module from source:
git clone https://github.com/masakistan/kcollections.git
cd kcollections
# python 3
python3 setup.py install
# or
pip3 install .
# python 2
python2 setup.py install
# or
pip2 install .
Additionally, if you would like to access the functions from a C++ program, you can build static libraries with the following steps:
git clone https://github.com/masakistan/kcollections.git
cd kcollections
mkdir build
cd build
cmake ..
make
Example Usage
Using Kset
Serial Insertion, writing/reading to/from disk
Kmers can be added one at a time with add
, but the fastest way to add kmers to a set is
to add a DNA sequence using add_seq
.
import kcollections
ks = kcollections.Kset(27)
# add single kmer
ks.add('AAACTGTCTTCCTTTATTTGTTCAGGG')
# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
ks.add_seq(seq, len(seq))
assert 'AAACTGTCTTCCTTTATTTGTTCAGGG' in ks
# iteration
for kmer in ks:
print kmer
ks.write("data.bs")
del ks
# sometime later
ks = Kcollections.Kset()
ks.read("data.bs")
Parallel Insertion
The fastest way to use Kset
is to use multithreaded insertion.
Multithreaded approach is best used when all kmers are loaded upfront.
Kmers not accessible until the threads have been joined using
parallel_add_join
.
See the example below on how parallel and serial insertions can be used.
import kcollections
ks = kcollections.Kset(27)
# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
ks.parallel_add_init(16)
# insert a sequence of kmers
ks.parallel_add_seq(seq)
# insert a single kmer
ks.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
# merge threads together
# no parallel add methods can be used after joining
ks.parallel_add_join()
# serial add is permissible after threads have joined
ks.add('AAACTGTCTTCCTTTATTTGTTCACAG')
# iteration
for kmer in ks:
print(kmer)
print len(ks)
Using Kdict
Serial Insertion
import kcollections
kd = kcollections.Kdict(str, 27)
# insertion and value assignment
kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] = 'banana'
kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] = 'phone'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 'banana'
assert kd['AAACTGTCTTCCTTTATTTGTTCAGGT'] == 'phone'
# iteration
for kmer, val in kd.items():
print(kmer, val)
# removal
del kd['AAACTGTCTTCCTTTATTTGTTCAGGT']
Parallel Insertion
Parallel insertion for Kdict
requires the inclusion of a merging function to resolve different values for the same key.
The following snippet adds 27mers from a string of DNA using a provided lambda function to merge value conflicts.
This merge function simply keeps the newest value associated with the kmer.
More examples of merging functions with Kdict
can be found here.
kd = kcollections.Kdict(int, 27)
kd.parallel_add_init(4)
kd.set_merge_func(lambda prev_val, new_val: new_val)
kd.parallel_add_seq(dna, generate_idx(len(dna)))
kd.parallel_add_join()
Using Kcounter
Kcounter
is an implementation of the Python collection's
Counter,
but the keys must be kmers, of course!
Like Kdict
, kmers can be added to Kcounter
one at a time, but the
fastest ways to add kmers to a set is to add an DNA sequence using add_seq
(or
parallel_add_seq
for multithreaded inserts).
Serial Insertion
from kcollections import Kcounter
kc = Kcounter(27)
# add single kmer
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
# sequence insertion
seq = 'AAACTGTCTTCCTTTATTTGTTCAGGGATCGTGTCAGTA'
kc.add_seq(seq)
assert kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] == 2
# iteration
for kmer, count in kc.items():
print(kmer, count)
Parallel Insertion
from kcollections import Kcounter
kc = Kcounter(27)
# multithreaded sequence insertion
# nthreads must be a power of 2.
# nthreads = 4 or 16 work well
kc.parallel_add_init(16)
# insert a sequence of kmers
kc.parallel_add_seq(seq)
# insert a single kmer
kc.parallel_add('AAACTGTCTTCCTTTATTTGTTCACAG')
# merge threads together
# no parallel add methods can be used after joining
kc.parallel_add_join()
# updates can be made after the join
kc['AAACTGTCTTCCTTTATTTGTTCAGGG'] += 1
# iteration
for kmer, count in kc.items():
print(kmer, count)
Performance
kcollections
is quite a bit slower than the dict
or set
but is much more memory-efficient.
We measured memory usage and running time using /usr/bin/time -v
on aIntel(R) Xeon(R) E5-2650v4 @2.20GHz
with 256 GB RAM.
27mers used for testing were taken from the human genome, no repetitive kmers appear in our dataset providing a worst case scenario where no insertions or queries are pruned before traversing the entire data structure.
Memory Usage
# kmers | kset |
set |
---|---|---|
1 million | 25.32 MB | 105.82 MB |
10 million | 63.74 MB | 906.96 MB |
100 million | 0.56 GB | 11.98 GB |
500 million | 2.42 GB | 48.54 GB |
1 billion | 4.43 GB | 97.07 GB |
1.5 billion | 6.44 GB | 191.61 GB |
2 billion | 8.44 GB | 194.14 GB |
2.4 billion | 10.08 GB | 220.06 GB |
Insertion Time
Insertion time comparisons using built-in Python set, kcollections
serial and parallel insert.
Read Mapper and Assembler
An example read mapping algorithm and assembler are provided using kcollections
in the applications
directory.
Citation
Acknowledgements
kcollections
was built at the Computational Science Laboratory at Brigham Young University by Stanley Fujimoto (@masakistan) and Cole Lyman (@colelyman).
Funding was provided by the Utah NASA Space Grant Consortium and the BYU Graduate Research Fellowship.