Merge and intersect sorted numpy arrays.
Project description
Sortednp
Numpy and Numpy arrays are a really great tool. However, intersecting and merging multiple sorted numpy arrays is rather less performant. The current numpy implementation concatenates the two arrays and sorts the combination. If you want to merge or intersect multiple numpy arrays, there is a much faster way, by using the property, that the resulting array is sorted.
Sortednp (sorted numpy) operates on sorted numpy arrays to calculate the intersection or the union of two numpy arrays in an efficient way. The resulting array is again a sorted numpy array, which can be merged or intersected with the next array. The intended use case is that sorted numpy arrays are sorted as the basic data structure and merged or intersected at request. Typical applications include information retrieval and search engines in particular.
It is also possible to implement a kway merging or intersecting algorithm,
which operates on an arbitrary number of arrays at the same time. This package
is intended to deal with arrays with $10^6
$ or $10^{10}
$ items. Usually, these
arrays are too large to keep more than two of them in memory at the same
time. This package implements methods to merge and intersect multiple arrays,
which can be loaded ondemand.
Installation
There are two different methods to install sortednp
.
Using pip
(recommended)
You can install the package directly from PyPI using pip
(here pip3
). There are
precompiled wheels for linux
32 and 64bit.
$ pip3 install sortednp
Using setuptools
Alternatively, you can clone the git repository and run the setup script.
$ git clone https://gitlab.sauerburger.com/frank/sortednp.git
$ cd sortednp
$ python3 setup.py install
Numpy Dependency
The installation fails in some cases, because of a buildtime dependency on
numpy. Usually, the problem can be solved by manually installing a recent numpy
version via pip3 install U numpy
.
Usage
The package provides two different kinds of methods. The first class is intended to operate on two arrays. The second class operates on two or more arrays and calls the first class of methods internally.
Twoway methods
Two numpy sorted arrays can be merged with the merge
method, which takes two
numpy arrays and returns the sorted union of the two arrays.
## merge.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
m = snp.merge(a, b)
print(m)
If you run this, you should see the union of both arrays as a sorted numpy array.
$ python3 merge.py
[0 1 2 3 3 4 5 6 7 7 9]
Two sorted numpy arrays can be intersected with the intersect
method, which takes two
numpy arrays and returns the sorted intersection of the two arrays.
## intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([1, 2, 3, 5, 7, 9])
i = snp.intersect(a, b)
print(i)
If you run this, you should see the intersection of both arrays as a sorted numpy array.
$ python3 intersect.py
[3 7]
Since version 0.4.0, the library provides the issubset(a, b)
method which
checks if the array a
is a subset of b
, and the isitem(v, a)
method which
checks if value
is contained in array a
.
## set.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 5, 10])
b = np.array([1, 2, 3, 4, 5, 6, 10, 11])
print(snp.issubset(a, b)) # a is subset of b
print(snp.issubset(b, a)) # b is not a subset of a
print(snp.isitem(4, a)) # 4 is an item of a
print(snp.isitem(3, a)) # 3 is not an item of a
If you execute this example, you get the expected result: a
is a subset ob
b
, 4
is a member of a
.
$ python3 set.py
True
False
True
False
Returning array indices
The intersect
method takes an optional argument indices
which is False
by default. If this is set to True
, the return value consists of the
intersection array and a tuple with the indices of the common values for both
arrays. The index arrays have the length of the output. The indices show the
position in the input from which the value was copied.
## intersect_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4,6,8,10])
b = np.array([1,2,3,4])
intersection, indices = snp.intersect(a,b, indices=True)
print(intersection)
print(indices)
The above example gives:
$ python3 intersect_indices.py
[2 4]
(array([0, 1]), array([1, 3]))
The first line shows the intersection of the two arrays. The second line
prints a tuple with the indices where the common values appeared in the input
arrays. For example, the value 4
is at position 1
in array a
and at position
3
in array b
.
Since version 0.3.0, the merge
has to indices
argument too. The returned
indices have the length of the inputs. The indices show the position in the
output to which an input value was copied.
## merge_indices.py
import numpy as np
import sortednp as snp
a = np.array([2,4])
b = np.array([3,4,5])
merged, indices = snp.merge(a,b, indices=True)
print(merged)
print(indices)
The above example gives:
$ python3 merge_indices.py
[2 3 4 4 5]
(array([0, 2]), array([1, 3, 4]))
The first line shows that the two arrays have been merged. The second line
prints a tuple with the indices. For example, the value 3
from array b
can
be found at position 1
in the output.
Duplicate treatment
Since version 0.3.0, sortednp supported multiple different strategies to deal with duplicated entries.
Duplicates during intersecting
There are three different duplicate treatments for the intersect method:

sortednp.DROP
: Ignore any duplicated entries. The output will contain only unique values. 
sortednp.KEEP_MIN_N
: If an entry occursn
times in one input array andm
times in the other input array, the output will contain the entrymin(n, m)
times. 
sortednp.KEEP_MAX_N
: If an entry occursn
times in one input array andm
times in the other input array, the output will contain the entrymax(n, m)
times (assuming the entry occurs at least once in both arrays, i.e.n > 0
andm > 0
).
The strategy can be selected with the optional duplicates
argument of
intersect
. The default is sortednp.KEEP_MIN_N
. Consider the following example.
## intersect_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
intersect_drop = snp.intersect(a, b, duplicates=snp.DROP)
print(intersect_drop) # Contains a single 4
intersect_min = snp.intersect(a, b, duplicates=snp.KEEP_MIN_N)
print(intersect_min) # Contains 4 twice
intersect_max = snp.intersect(a, b, duplicates=snp.KEEP_MAX_N)
print(intersect_max) # Contains 4 three times
The above example gives:
$ python3 intersect_duplicates.py
[4 5]
[4 4 5]
[4 4 4 5]
Duplicates during merging
The merge
method offers three different duplicates treatment strategies:

sortednp.DROP
: Ignore any duplicated entries. The output will contain only unique values. 
sortednp.DROP_IN_INPUT
: Ignores duplicated entries in the input arrays separately. This is the same as ensuring that each input array unique values. The output contains every value at most twice. 
sortednp.KEEP
: Keep all duplicated entries. If an item occursn
times in one input array andm
times in the other input array, the output contains the itemn + m
times.
The strategy can be selected with the optional duplicates
.
The default is sortednp.KEEP
. Consider the following example.
## merge_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
merge_drop = snp.merge(a, b, duplicates=snp.DROP)
print(merge_drop) # Contains a single 4
merge_dii = snp.merge(a, b, duplicates=snp.DROP_IN_INPUT)
print(merge_dii) # Contains 4 twice
merge_keep = snp.merge(a, b, duplicates=snp.KEEP)
print(merge_keep) # Contains 4 five times
The above example gives:
$ python3 merge_duplicates.py
[2 3 4 5]
[2 3 4 4 5 5]
[2 3 4 4 4 4 4 5 5]
Duplicates during subset checks
The issubset
method offers two different duplicates treatment strategies:

sortednp.IGNORE
: Ignore any duplications. The method returns True if each value in the first array is contained at least once in the second array. Duplicated entries in the first array do not change the return value. 
sortednp.REPEAT
: For each duplicated item in the first array, require at least as many items in the second array. If for one value the first array contains more duplicated entries than the second array, the method returns False.
The strategy can be selected with the optional duplicates
.
The default is sortednp.IGNORE
. Consider the following example.
## subset_duplicates.py
import numpy as np
import sortednp as snp
a = np.array([3, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Number of occurances ignored
print(snp.issubset(a, b, duplicates=snp.IGNORE)) # is subset
print(snp.issubset(b, a, duplicates=snp.IGNORE)) # is subset
# Number of in subset must be smaller or equal
print(snp.issubset(a, b, duplicates=snp.REPEAT)) # is subset
# three 4s not subset of two 4s
print(snp.issubset(b, a, duplicates=snp.REPEAT))
The above example gives:
$ python3 subset_duplicates.py
True
True
True
False
Index tracking and duplicates
Tracking indices with the indices=True
argument is possible while selecting a
nondefault duplicate treatment strategy. For merging the indices point to the
position in the output array. If the input has duplicates that were skipped, the
index is simply repeated. For example with snp.DROP
, if the input is [9, 9, 9, 9]
, the index array for this input contains four times the position where
9
is found in the output.
Similarly, with snp.KEEP_MAX_N
and intersect
, the index of the last item in
the array with less occurrences is duplicates.
## duplicates_index.py
import numpy as np
import sortednp as snp
a = np.array([2, 4, 4, 5]) # Twice
b = np.array([3, 4, 4, 4, 5]) # Three times
# Merge
merge_drop, (index_a, index_b) = snp.merge(a, b,
duplicates=snp.DROP,
indices=True)
print(index_b)
# Intersect
intersect_max, (index_a, index_b) = snp.intersect(a, b,
duplicates=snp.KEEP_MAX_N,
indices=True)
print(index_a)
The above example gives:
$ python3 duplicates_index.py
[1 2 2 2 3]
[1 2 2 3]
For merging, this means that the three 4
s from the input all appear at same position
in the output, namely position 2
.
For the intersect, this means that the second and third occurrence of 4
in the
output, both came from item at position 2
in the input.
kway methods
Similarly, the kway intersect and merge methods take two or more arrays and perform the merge or intersect operation on its arguments.
## kway_intersect.py
import numpy as np
import sortednp as snp
a = np.array([0, 3, 4, 6, 7])
b = np.array([0, 3, 5, 7, 9])
c = np.array([1, 2, 3, 5, 7, 9])
d = np.array([2, 3, 6, 7, 8])
i = snp.kway_intersect(a, b, c, d)
print(i)
If you run this, you should see the intersection of all four arrays as a sorted numpy array.
$ python3 kway_intersect.py
[3 7]
The kway merger sortednp.kway_merge
works analogously. However, the native
numpy
implementation is faster compared to the merge provided by this package.
The kway merger has been added for completeness. The package heapq
provides
efficient methods to merge multiple arrays simultaneously.
The methods kway_merge
and kway_intersect
accept the optional keyword
argument assume_sorted
. By default, it is set to True
. If it is set to False
,
the method calls sort()
on the input arrays before performing the operation.
The default should be kept if the arrays are already sorted to save the time it
takes to sort the arrays.
Since the arrays might be too large to keep all of them in memory at the same
time, it is possible to pass a callable
instead of an array to the methods.
The callable
is expected to return the actual array. It is called immediately
before the array is required. This reduces the memory consumption.
Algorithms
Intersections are calculated by iterating both arrays. For a given element in
one array, the method needs to search the other and check if the element is
contained. In order to make this more efficient, we can use the fact that the
arrays are sorted. There are three search methods, which can be selected via the
optional keyword argument algorithm
.
sortednp.SIMPLE_SEARCH
: Search for an element by linearly iterating over the array elementbyelement. More Information.sortednp.BINARY_SEARCH
: Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.sortednp.GALLOPING_SEARCH
: First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.
The default is sortednp.GALLOPING_SEARCH
. The performance of all three
algorithms is compared in the next section. The methods issubset()
and
isitem()
also support the algorithm keyword.
Performance
The performance of the package can be compared with the default implementation of numpy, the intersect1d` method. The ratio of the execution time between sortednp and numpy is shown for various different benchmark tests.
The merge or intersect time can be estimated under two different assumptions. If
the arrays, which are merged or intersected, are already sorted, one should not
consider the time it takes to sort the random arrays in the benchmark. On the
other hand, if one considers a scenario in which the arrays are not sorted, one
should take the sorting time into account. The benchmarks here on this page,
assume that the arrays are already sorted. If you would like to benchmark the
package and include the sorting time, have a look at the methods defined in
ci/benchmark.py
.
The random scattering of the points indicates the uncertainty caused by random load fluctuations on the benchmark machine (Spikes of serveral orders of magnitude usualy mean that there was a shortage of memory and large chunks had to be moved to SWAP.)
Intersect
The performance of the intersection operation depends on the sparseness of the two arrays. For example, if the first element of one of the arrays is larger than all elements in the other array, only the other array has to be searched (linearly, binarily, or exponentially). Similarly, if the common elements are far apart in the arrays (sparseness), large chunks of the arrays can be skipped. The arrays in the benchmark contain random (unique) integers. The sparseness is defined as the average difference between two consecutive elements in one array.
The first set of tests studies the performance dependence on the size of the arrays. The second set of tests studies the dependence on the sparseness of the array for a fixed size of array. Every shows a colorcoded comparison of the performance of intersecting more than two arrays.
Test  Simple Search  Binary Search  Galloping Search 

Intersect  
Sparseness 
Merge
The following chart shows the performance of merging 2 or more arrays as a function of the array size. It is assumed that the arrays are already sorted.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for sortednp0.4.0cp39cp39manylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  0f8a4c4c140de4c0375c6525d993fcfececc59e805b0c17057dd0c27900cdf19 

MD5  6afce12dca5604cf99758697ce45dcc5 

BLAKE2b256  e35081f7855023b7d0dc84ab0c9ff298b2eced587e06a241509c5e7cc40523a8 
Hashes for sortednp0.4.0cp39cp39manylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  e469c372e3ec4a8fcd2811734b5e9939a7fcefdf698584b36078b9df45d7228a 

MD5  2a19cbda0915740d64f7b744025da7ec 

BLAKE2b256  406d1f4ecebbd97e9f3673ea9be31bebf136a5f5a62548d987828802042c0426 
Hashes for sortednp0.4.0cp39cp39manylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  e53d90de98ea8cc72b319797197ae11a798a5d4ed59c238fa4eb8b83cb387748 

MD5  55a7f73fecc688068420baa506cd639c 

BLAKE2b256  d8426addf231477f3b18ce134290ab5ca6496a41b8bf9eeff6af8993af801655 
Hashes for sortednp0.4.0cp39cp39manylinux1_i686.whl
Algorithm  Hash digest  

SHA256  e99c74f40319072089b19ce668d2737334ff5a3bcd06a8551eecfe6ee9b68d28 

MD5  160f7eb04b514a4d9b61c8afb0e7ad46 

BLAKE2b256  5806bf198be48c899a50f42641e69d84032b71dc98fce4db8703f7f71106e007 
Hashes for sortednp0.4.0cp38cp38manylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  28b499fafc5f14a74924c0e8b78038d46d58824308b27437a277434c79efa4de 

MD5  35d222d13d8dbae946583a41e8d722dc 

BLAKE2b256  a86c7de689d2632a093fb4d4dfb6fc9fab53b15a9640e39489342386dd26e13e 
Hashes for sortednp0.4.0cp38cp38manylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  6366d6aad08052122dafc371d7f9b9b4524998a83fc6e850a0d7470e5ae64339 

MD5  96fc16b0cd89084967dccda38348bfbc 

BLAKE2b256  69b04e4aa109b5dd0fcb2e1a8164a206500b6a966888085de508cd2d429444b8 
Hashes for sortednp0.4.0cp38cp38manylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  1d6d7be8161845962f18890cfb7ad6c3c15a2e7ae2b26d0bdafe1f17b5cc9b8a 

MD5  47cfb30ec9d9719c185e459453f51302 

BLAKE2b256  a3528aaae29a481edd242a27ad12cff44b134e3a26103043648e7b6dbc770396 
Hashes for sortednp0.4.0cp38cp38manylinux1_i686.whl
Algorithm  Hash digest  

SHA256  ba2e5c76fd6fbb75176edf16bf88aa483fe58bfa7a33ff0bfca3c52074504bbc 

MD5  5be515b8e51680553207606f80bc93cd 

BLAKE2b256  27b914c02f217c492dccfe49e691cff521d774843ffd27a3f3a6dc8a091740e6 
Hashes for sortednp0.4.0cp37cp37mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  64c3df93e0323b8b26f07c4b1514da74f59a9cbd93b1725eb66217a21043bbaa 

MD5  b94ddde818c30ae191bc1144bf022a61 

BLAKE2b256  5107c5fb2f6990d20a4e88f875912bf7663581cb5d81a941514ad9877550a9a4 
Hashes for sortednp0.4.0cp37cp37mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  8d396c4b9b34b672ee15988891520c95463fd6a062784f1fff06739c2dad3bdd 

MD5  4069c68dbe52940b2b21cbed7acdfdc2 

BLAKE2b256  e9db87e1a5330b7e02527b53d3964ee09e0280681dce75d992f6e6d7979c8393 
Hashes for sortednp0.4.0cp37cp37mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  418a4db80f494c7d1b5ed45da266306a930b789fbb5b81e2f451e68bdb4ea060 

MD5  aeb2aa73406209e8bdf141d59a64d82a 

BLAKE2b256  cbd9fe96fb6af6871afdfebff4db6ed26a8fddfbcf82ada830f0ab657c50d23a 
Hashes for sortednp0.4.0cp37cp37mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  595ccf3b96c8788ff230db7e730a8530eaa41e3db5f7ecc5a8e125724dafb6df 

MD5  6fd47be9fac10017541f675ab4d1e454 

BLAKE2b256  ad3ec898bd79877b281c70cf96e0421ac9072f7e456282f9b561fa1bbfbb89eb 
Hashes for sortednp0.4.0cp36cp36mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  7e9591b3b237b10ad8c7dc85b49c6c9dfa1a316ebd79a6994f6b1e09f6517566 

MD5  0a0b83cc137193e1a7aed022edabacf2 

BLAKE2b256  bf4e8e284588e070978472d300e3325a94d0a9f724f622bf523c65ca06cb6fe7 
Hashes for sortednp0.4.0cp36cp36mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  fe3cf250ff32bb4eb7b79351c6e9309655c41f885e3af0ceb0edf5811d265908 

MD5  13f0b9e7a89c267a2c3d4280c9f485cc 

BLAKE2b256  a8e577ea7d58690d8cd92b93b9d10a6f37fa1081e4a1772d60cb6a0ff74a72fb 
Hashes for sortednp0.4.0cp36cp36mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  8cccf8f54de06cc21036ab530da0e4b1e3c749504ac33bb8c1f984820f708b97 

MD5  2981c57525814cd0f0d43ca434423bdc 

BLAKE2b256  0fb0c62d3581ba55fff14eaf10e4ced20f44b25940d440f1c13f4c26bd88326c 
Hashes for sortednp0.4.0cp36cp36mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  b916e97dec8fc2f1032c24505e5f6dd3ab3798d2a21c0c40a65f47c6157b0374 

MD5  eb6b01ac02bbaa841fb16129cb3ec2d2 

BLAKE2b256  e3f39c4ef2cd5cb59b3df9872e5e70cf153be448f91a141352e13d4a5b069890 
Hashes for sortednp0.4.0cp35cp35mmanylinux2010_x86_64.whl
Algorithm  Hash digest  

SHA256  43b50e77a0f5ddbce3e0864507ec6b8fb94b6eddc481a2846609499d6b022ba7 

MD5  a16fa515c21708c167877bc3620921a3 

BLAKE2b256  2424b603b373eb3718a8bf209f3aac44aed1432bb8e2ccbcf4251a4c5f54ec17 
Hashes for sortednp0.4.0cp35cp35mmanylinux2010_i686.whl
Algorithm  Hash digest  

SHA256  f9fd982771f9a053d4e89160219b9a9622eb90690b89233000702bab32a605a3 

MD5  d4e7c52580ead5f2615fe3ac0ab5a19b 

BLAKE2b256  39a40a93fe7b436620e8b04cd3700ed5e67b3c453bf7993015b351437f8a6f89 
Hashes for sortednp0.4.0cp35cp35mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  dada9f29b7dfd589e9a3f12cdc8484898a49e9c6e4d182d46798a9e8eabfddea 

MD5  9e9d6835e83a9acd382ec1848f353d0f 

BLAKE2b256  af30936172f91d5fec4dfb6433763c54ed6d8ae3b06a3d18a25eb428ed3ab6ad 
Hashes for sortednp0.4.0cp35cp35mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  af83cb46c4fa35288d16f1550a6ec40bd865a909640e92b4d3d16c0a7d7d8529 

MD5  a626af2df50c00f3717b690281936c7e 

BLAKE2b256  51cbd198456cae06cfcd5e17c7b60929019d6d09516e960de5bf9e0fa0a06238 