Interval intersection for sorted query and target intervals
Project description
Fast intersection of sorted intervals and sorted queries A common task in bioinformatics is to check if an interval or point overlaps a set of reference intervals. An interval tree is often used for this purpose for queries in O(log n) time. However, if both the reference intervals and query intervals are sorted ahead of time, then query time can be reduced. For example, this situation arises when processing a sorted alignment file and checking against a sorted reference interval set.
Installation
To build from source:
git clone https://github.com/kcleal/sortedintersect cd sortedintersect pip install .
Overview
Common usage is to check if a point overlaps a reference set:
from sortedintersect import IntervalSet
# intervals without data
itv = IntervalSet(False)
itv.add(0, 10)
itv.search_point(1)
# >>> True
# intervals with python object as data
itv = IntervalSet(True)
itv.add(0, 10, 'interval1')
itv.add(20, 30, {'a': 1})
itv.search_point(1)
# >>> 'interval1'
itv.search_point(20)
# >>> {'a': 1}
Note, both reference and query intervals must be added and queried in sorted order otherwise a ValueError will be raised:
# intervals without data
itv = IntervalSet(False)
itv.add(10, 11)
itv.add(0, 1)
# >>> ValueError
itv.search_point(10) # True
itv.search_point(9)
# >>> ValueError
Intervals can also be queried:
# intervals without data
itv = IntervalSet(False)
itv.add(10, 11)
itv.add(50, 60)
itv.search_interval(9, 10) # True
itv.search_interval(20, 30) # False
itv.search_interval(50, 51) # True
Benchmarks
sortedintersect was compared to popular python implementations based on interval trees, see tests folder. A major advantage of comparison tools is that queries can be performed in non-sorted order, which is not the case for sortedintersect:
N intersections |
sortedintersect (s) |
quicksect (s) |
ncls (s) |
ailist (s) |
---|---|---|---|---|
100 k |
0.015715 |
0.033503 |
0.123314 |
0.053892 |
1 million |
0.141112 |
0.334860 |
1.230206 |
0.570837 |
10 million |
1.499679 |
3.962883 |
12.819762 |
5.696887 |
100 million |
14.734016 |
40.743349 |
127.128570 |
56.149942 |
Limitations
Both reference and queries must be assessed in sorted order
Only the first overlapping interval is currently returned
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sortedintersect-0.1.0-py3.8-macosx-10.9-x86_64.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c4cac477ba594f3c11a8ad4ed7f264092ac9383e80a21615dc46a3ecfcbc7cf |
|
MD5 | 661c1e0b06e39130caa3b9506da3a145 |
|
BLAKE2b-256 | c747b0dc037b808bbc47e29fcde155586484511d6474f123c0f7523c53883b1d |