Merge and intersect sorted numpy arrays.
Project description
Sortednp
Numpy and Numpy arrays are a really great tool. However, intersecting and merging multiple sorted numpy arrays is rather less performant. The current numpy implementation concatenates the two arrays and sorts the combination. If you want to merge or intersect multiple numpy arrays, there is a much faster way, by using the property, that the resulting array is sorted.
Sortednp (sorted numpy) operates on sorted numpy arrays to calculate the intersection or the union of two numpy arrays in an efficient way. The resulting array is again a sorted numpy array, which can be merged or intersected with the next array. The intended use case is that sorted numpy arrays are sorted as the basic data structure and merged or intersected at request. Typical applications include information retrieval and search engines in particular.
It is also possible to implement a kway merging or intersecting algorithm,
which operates on an arbitrary number of arrays at the same time. This package
is intended to deal with arrays with $10^6
$ or $10^{10}
$ items. Usually, these
arrays are too large to keep more than two of them in memory at the same
time. This package implements methods to merge and intersect multiple arrays,
which can be loaded ondemand.
Installation
There are two different methods to install sortednp
.
Using pip
(recommended)
You can install the package directly from PyPI using pip
(here pip3
). There are
precompiled wheels for linux
32 and 64bit.
$ pip3 install sortednp
Using setuptools
Alternatively, you can clone the git repository and run the setup script.
$ git clone https://gitlab.sauerburger.com/frank/sortednp.git
$ cd sortednp
$ python3 setup.py install
Numpy Dependency
The installation fails in some cases, because of a buildtime dependency on
numpy. Usually, the problem can be solved by manually installing a recent numpy
version via pip3 install U numpy
.
Usage
The package provides two different kinds of methods. The first class is intended to operate on two arrays. The second class operates on two or more arrays and calls the first class of methods internally.
Twoway methods
Two numpy sorted arrays can be merged with the merge
method, which takes two
numpy arrays and returns the sorted union of the two arrays.
## merge.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
m = snp.merge(a, b)
print(m)
If you run this, you should see the union of both arrays as a sorted numpy array.
$ python3 merge.py
[0 1 2 3 3 4 5 6 7 7 9]
Two sorted numpy arrays can be intersected with the intersect
method, which takes two
numpy arrays and returns the sorted intersection of the two arrays.
## intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
i = snp.intersect(a, b)
print(i)
If you run this, you should see the intersection of both arrays as a sorted numpy array.
$ python3 intersect.py
[3 7]
Since version 0.4.0, the library provides the issubset(a, b)
method which
checks if the array a
is a subset of b
, and the isitem(v, a)
method which
checks if value
is contained in array a
.
## set.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 5, 10])
b = np.array([1, 2, 3, 4, 5, 6, 10, 11])
print(snp.issubset(a, b)) # a is subset of b
print(snp.issubset(b, a)) # b is not a subset of a
print(snp.isitem(4, a)) # 4 is an item of a
print(snp.isitem(3, a)) # 3 is not an item of a
If you execute this example, you get the expected result: a
is a subset ob
b
, 4
is a member of a
.
$ python3 set.py
True
False
True
False
Returning array indices
The intersect
method takes an optional argument indices
which is False
by default. If this is set to True
, the return value consists of the
intersection array and a tuple with the indices of the common values for both
arrays. The index arrays have the length of the output. The indices show the
position in the input from which the value was copied.
## intersect_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4,6,8,10])
b = np.array([1,2,3,4])
intersection, indices = snp.intersect(a,b, indices=True)
print(intersection)
print(indices)
The above example gives:
$ python3 intersect_indices.py
[2 4]
(array([0, 1]), array([1, 3]))
The first line shows the intersection of the two arrays. The second line
prints a tuple with the indices where the common values appeared in the input
arrays. For example, the value 4
is at position 1
in array a
and at position
3
in array b
.
Since version 0.3.0, the merge
has to indices
argument too. The returned
indices have the length of the inputs. The indices show the position in the
output to which an input value was copied.
## merge_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4])
b = np.array([3,4,5])
merged, indices = snp.merge(a,b, indices=True)
print(merged)
print(indices)
The above example gives:
$ python3 merge_indices.py
[2 3 4 4 5]
(array([0, 2]), array([1, 3, 4]))
The first line shows that the two arrays have been merged. The second line
prints a tuple with the indices. For example, the value 3
from array b
can
be found at position 1
in the output.
Duplicate treatment
Since version 0.3.0, sortednp supported multiple different strategies to deal with duplicated entries.
Duplicates during intersecting
There are three different duplicate treatments for the intersect method:

sortednp.DROP
: Ignore any duplicated entries. The output will contain only unique values. 
sortednp.KEEP_MIN_N
: If an entry occursn
times in one input array andm
times in the other input array, the output will contain the entrymin(n, m)
times. 
sortednp.KEEP_MAX_N
: If an entry occursn
times in one input array andm
times in the other input array, the output will contain the entrymax(n, m)
times (assuming the entry occurs at least once in both arrays, i.e.n > 0
andm > 0
).
The strategy can be selected with the optional duplicates
argument of
intersect
. The default is sortednp.KEEP_MIN_N
. Consider the following example.
## intersect_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
intersect_drop = snp.intersect(a, b, duplicates=snp.DROP)
print(intersect_drop) # Contains a single 4
intersect_min = snp.intersect(a, b, duplicates=snp.KEEP_MIN_N)
print(intersect_min) # Contains 4 twice
intersect_max = snp.intersect(a, b, duplicates=snp.KEEP_MAX_N)
print(intersect_max) # Contains 4 three times
The above example gives:
$ python3 intersect_duplicates.py
[4 5]
[4 4 5]
[4 4 4 5]
Duplicates during merging
The merge
method offers three different duplicates treatment strategies:

sortednp.DROP
: Ignore any duplicated entries. The output will contain only unique values. 
sortednp.DROP_IN_INPUT
: Ignores duplicated entries in the input arrays separately. This is the same as ensuring that each input array unique values. The output contains every value at most twice. 
sortednp.KEEP
: Keep all duplicated entries. If an item occursn
times in one input array andm
times in the other input array, the output contains the itemn + m
times.
The strategy can be selected with the optional duplicates
.
The default is sortednp.KEEP
. Consider the following example.
## merge_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
merge_drop = snp.merge(a, b, duplicates=snp.DROP)
print(merge_drop) # Contains a single 4
merge_dii = snp.merge(a, b, duplicates=snp.DROP_IN_INPUT)
print(merge_dii) # Contains 4 twice
merge_keep = snp.merge(a, b, duplicates=snp.KEEP)
print(merge_keep) # Contains 4 five times
The above example gives:
$ python3 merge_duplicates.py
[2 3 4 5]
[2 3 4 4 5 5]
[2 3 4 4 4 4 4 5 5]
Duplicates during subset checks
The issubset
method offers two different duplicates treatment strategies:

sortednp.IGNORE
: Ignore any duplications. The method returns True if each value in the first array is contained at least once in the second array. Duplicated entries in the first array do not change the return value. 
sortednp.REPEAT
: For each duplicated item in the first array, require at least as many items in the second array. If for one value the first array contains more duplicated entries than the second array, the method returns False.
The strategy can be selected with the optional duplicates
.
The default is sortednp.IGNORE
. Consider the following example.
## subset_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([3, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Number of occurances ignored
print(snp.issubset(a, b, duplicates=snp.IGNORE)) # is subset
print(snp.issubset(b, a, duplicates=snp.IGNORE)) # is subset
# Number of in subset must be smaller or equal
print(snp.issubset(a, b, duplicates=snp.REPEAT)) # is subset
# three 4s not subset of two 4s
print(snp.issubset(b, a, duplicates=snp.REPEAT))
The above example gives:
$ python3 subset_duplicates.py
True
True
True
False
Index tracking and duplicates
Tracking indices with the indices=True
argument is possible while selecting a
nondefault duplicate treatment strategy. For merging the indices point to the
position in the output array. If the input has duplicates that were skipped, the
index is simply repeated. For example with snp.DROP
, if the input is [9, 9, 9, 9]
, the index array for this input contains four times the position where
9
is found in the output.
Similarly, with snp.KEEP_MAX_N
and intersect
, the index of the last item in
the array with less occurrences is duplicates.
## duplicates_index.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Merge
merge_drop, (index_a, index_b) = snp.merge(a, b,
duplicates=snp.DROP,
indices=True)
print(index_b)
# Intersect
intersect_max, (index_a, index_b) = snp.intersect(a, b,
duplicates=snp.KEEP_MAX_N,
indices=True)
print(index_a)
The above example gives:
$ python3 duplicates_index.py
[1 2 2 2 3]
[1 2 2 3]
For merging, this means that the three 4
s from the input all appear at same position
in the output, namely position 2
.
For the intersect, this means that the second and third occurrence of 4
in the
output, both came from item at position 2
in the input.
kway methods
Similarly, the kway intersect and merge methods take two or more arrays and perform the merge or intersect operation on its arguments.
## kway_intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([0, 3, 5, 7, 9])
c = np.array([1, 2, 3, 5, 7, 9])
d = np.array([2, 3, 6, 7, 8])
i = snp.kway_intersect(a, b, c, d)
print(i)
If you run this, you should see the intersection of all four arrays as a sorted numpy array.
$ python3 kway_intersect.py
[3 7]
The kway merger sortednp.kway_merge
works analogously. However, the native
numpy
implementation is faster compared to the merge provided by this package.
The kway merger has been added for completeness. The package heapq
provides
efficient methods to merge multiple arrays simultaneously.
The methods kway_merge
and kway_intersect
accept the optional keyword
argument assume_sorted
. By default, it is set to True
. If it is set to False
,
the method calls sort()
on the input arrays before performing the operation.
The default should be kept if the arrays are already sorted to save the time it
takes to sort the arrays.
Since the arrays might be too large to keep all of them in memory at the same
time, it is possible to pass a callable
instead of an array to the methods.
The callable
is expected to return the actual array. It is called immediately
before the array is required. This reduces the memory consumption.
Algorithms
Intersections are calculated by iterating both arrays. For a given element in
one array, the method needs to search the other and check if the element is
contained. In order to make this more efficient, we can use the fact that the
arrays are sorted. There are three search methods, which can be selected via the
optional keyword argument algorithm
.
sortednp.SIMPLE_SEARCH
: Search for an element by linearly iterating over the array elementbyelement. More Information.sortednp.BINARY_SEARCH
: Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.sortednp.GALLOPING_SEARCH
: First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.
The default is sortednp.GALLOPING_SEARCH
. The performance of all three
algorithms is compared in the next section. The methods issubset()
and
isitem()
also support the algorithm keyword.
Performance
The performance of the package can be compared with the default implementation of numpy, the intersect1d` method. The ratio of the execution time between sortednp and numpy is shown for various different benchmark tests.
The merge or intersect time can be estimated under two different assumptions. If
the arrays, which are merged or intersected, are already sorted, one should not
consider the time it takes to sort the random arrays in the benchmark. On the
other hand, if one considers a scenario in which the arrays are not sorted, one
should take the sorting time into account. The benchmarks here on this page,
assume that the arrays are already sorted. If you would like to benchmark the
package and include the sorting time, have a look at the methods defined in
ci/benchmark.py
.
The random scattering of the points indicates the uncertainty caused by random load fluctuations on the benchmark machine (Spikes of serveral orders of magnitude usualy mean that there was a shortage of memory and large chunks had to be moved to SWAP.)
Intersect
The performance of the intersection operation depends on the sparseness of the two arrays. For example, if the first element of one of the arrays is larger than all elements in the other array, only the other array has to be searched (linearly, binarily, or exponentially). Similarly, if the common elements are far apart in the arrays (sparseness), large chunks of the arrays can be skipped. The arrays in the benchmark contain random (unique) integers. The sparseness is defined as the average difference between two consecutive elements in one array.
The first set of tests studies the performance dependence on the size of the arrays. The second set of tests studies the dependence on the sparseness of the array for a fixed size of array. Every shows a colorcoded comparison of the performance of intersecting more than two arrays.
Test  Simple Search  Binary Search  Galloping Search 

Intersect  
Sparseness 
Merge
The following chart shows the performance of merging 2 or more arrays as a function of the array size. It is assumed that the arrays are already sorted.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for sortednp0.4.1cp38cp38manylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  e7654f981ae4f2e9526a68de290ced852245c91dcd9d5c6648ee208fa7aa318c 

MD5  9ff28472a80418471ed0d7be8e4f189e 

BLAKE2b256  cf72a304c9c9f0869b2cc2b27d8641fc3cdbc1e731112e05f182901d910d3a69 
Hashes for sortednp0.4.1cp38cp38manylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  588f2715a4c3ef9f639ae05b47971bd6d43815dff10ec1d3045ae8f7bbcacd72 

MD5  e8eb3fbc73eba8c9dab6293421bb23b2 

BLAKE2b256  e90a839386a3f667e9f487db1dd01f3409bcda9e6373b64f219e63cbebcd76a9 
Hashes for sortednp0.4.1cp38cp38manylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  e9fc48474dbcc10062fd54d29af17c744acfbff3fc95374ba238503780ece6e4 

MD5  e73f1e0f7d3955e93803fd030bc593e3 

BLAKE2b256  f2dc42772d7ed777b5ada8d7e3c02fe7a08a4a0ef0be8e0678091d2a8c50099c 
Hashes for sortednp0.4.1cp38cp38manylinux1_i686.whl
Algorithm  Hash digest  

SHA256  81904bf01ca3391ede329b88e6dc59245a572f2be58cc2020bfb8f14602672aa 

MD5  82d82d64f1152b0562c0d1e51ac20cc7 

BLAKE2b256  691f77844a168cb66e7eed0227064eb09561f9c1ea0c230974281f5b38915f83 
Hashes for sortednp0.4.1cp37cp37mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  5157805e2cd7fd44530aceda9728aaadb2241c69e767e76ff604f3d70f25d7bd 

MD5  809cde55d453690f857e14d4c96f85ff 

BLAKE2b256  97313745de32b5647e6df57adc209932cb1cad612cfa77a73032654038782d7d 
Hashes for sortednp0.4.1cp37cp37mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  5c8d9b495505e5fb14a8b91918b23a16ca0ef45824a0f1a306c04ad377a0924c 

MD5  c5676e2fe25ad7966975487d494e7305 

BLAKE2b256  1f3d8f1ca6ece128fb91a2c6b1d6257cb70146e79ba01ba702f670cc31f1399a 
Hashes for sortednp0.4.1cp37cp37mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  cc8fff82362f160556ad041aa00ee840f537635bc0037047a44d348135b92f09 

MD5  ff9b74ba65d7237c75f50c03b42681bb 

BLAKE2b256  6d17900314dff7580bc15b2bfc2ee66d8e892fbbe31b80960d8df35de98e8d99 
Hashes for sortednp0.4.1cp37cp37mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  75da8f95da2d51b98fe3428b03acec279fbfda47c355b499f1e2359cc8af8930 

MD5  bc6ff6aa6e65a8bb0a4171c5801503ab 

BLAKE2b256  b2be9f2beae0b6cba510f5f1bbfff706933c35d3938e7a93f96d6be2baa400b1 
Hashes for sortednp0.4.1cp36cp36mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  d034e4200a0cc72cfbf31f911228a02250feb4e9c3be8b0387f0b52179a3e03c 

MD5  49985aacfbc350ce661bffd6c304be0b 

BLAKE2b256  d03cea3e17ded5e3d0a3dd514599c464d6469ed18fdac2ab1e2bd80b85a9dfe7 
Hashes for sortednp0.4.1cp36cp36mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  cbc2518eed5123226a44d3f77237c6f79b09039082be879f56a6adb316b439fd 

MD5  3dc0b01a242859acd355be531cef2440 

BLAKE2b256  f953f1d40b4fdd610462591efad3887fbc150d7dff6baba5993ae26934fe25e1 
Hashes for sortednp0.4.1cp36cp36mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  d35550629c33358b627da744892d50b758b0bfbff2aeaf9c6e332b6dbe395d80 

MD5  ce1d3ac650d187629e33386d4459f338 

BLAKE2b256  aab40c9c91af9cdf288f109a6979700dcc49c021f1576483c26fe9e0d9c50c65 
Hashes for sortednp0.4.1cp36cp36mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  3b86c09addbb96cc39539c12f371e864f1c9d11ce8d0ebe97527c363e1878832 

MD5  9749dd324b32b697a83ae4db6fdc4302 

BLAKE2b256  04afeace4d749ad563992681627aa89881876bd11ba683479b96138ed7d0d3dd 
Hashes for sortednp0.4.1cp35cp35mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  cc30d5e857f50206c33768e233476d4a634fbbfa68d059ac9943499dc3b5e5c5 

MD5  21c742bb8fd7a566921075e6a9317084 

BLAKE2b256  29e482d700c79717fa570bb8fbb7ce88e3172e7f42d575ebec168ef5a5aa0ffb 
Hashes for sortednp0.4.1cp35cp35mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  f5e923121a8b91f8f58efae795fa0c2fcdae7d214555e3c4724bbacf8cafca0c 

MD5  0252ec9bac28cc45b107d6cf3e9a8708 

BLAKE2b256  73c9316cd5ce15383f7516edff96fa63a9f30755b1fd300760a701649646bd11 
Hashes for sortednp0.4.1cp35cp35mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  09a289e2cc26df026048596a024954fb9cfd35b01398d11bd7ded1fad4f113a7 

MD5  03895843d3dc7bd8ef30fb8cf9fa77ac 

BLAKE2b256  a812e2b856b822aa91921b303edabc64b91489913d7dc9a5f2c1fda8abe3f434 
Hashes for sortednp0.4.1cp35cp35mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  55b0586a816d42631879aea121e3a1e2d64db8ba5d38c497ecdb895dc46aab2b 

MD5  4c5bbe87450fc445b878c55ca5809bc1 

BLAKE2b256  4a63b58c345d59b819e734d34e36cd0d3945934046767fcc82607b4921988ab0 