Merge and intersect sorted numpy arrays.
Project description
Numpy and Numpy arrays are a really great tool. However, intersecting and merging multiple sorted numpy arrays is rather less performant. The current numpy implementation concatenates the two arrays and sorts the combination. If you want to merge or intersect multiple numpy arrays, there is a much faster way, by using the property, that the resulting array is sorted.
Sortednp (sorted numpy) operates on sorted numpy arrays to calculate the intersection or the union of two numpy arrays in an efficient way. The resulting array is again a sorted numpy array, which can be merged or intersected with the next array. The intended use case is that sorted numpy arrays are sorted as the basic data structure and merged or intersected at request. Typical applications include information retrieval and search engines in particular.
It is also possible to implement a kway merging or intersecting algorithm,
which operates on an arbitrary number of arrays at the same time. This package
is intended to deal with arrays with $10^6
$ or $10^{10}
$ items. Usually, these
arrays are too large to keep more than two of them in memory at the same
time. This package implements methods to merge and intersect multiple arrays,
which can be loaded ondemand.
Installation
There are two different methods to install sortednp
.
Using pip
(recommended)
You can install the package directly from PyPI using pip
(here pip3
). There are
precompiled wheels for linux
32 and 64bit.
$ pip3 install sortednp
Using setuptools
Alternatively, you can clone the git repository and run the setup script.
$ git clone https://gitlab.sauerburger.com/frank/sortednp.git
$ cd sortednp
$ python3 setup.py install
Numpy Dependency
The installation fails in some cases, because of a buildtime dependency on
numpy. Usually, the problem can be solved by manually installing a recent numpy
version via pip3 install U numpy
.
Usage
The package provides two different kinds of methods. The first class is intended to operate on two arrays. The second class operates on two or more arrays and calls the first class of methods internally.
Twoway methods
Two numpy sorted arrays can be merged with the merge
method, which takes two
numpy arrays and returns the sorted union of the two arrays.
# merge.py import numpy as np import sortednp as snp a = np.array([0, 3, 4, 6, 7]) b = np.array([1, 2, 3, 5, 7, 9]) m = snp.merge(a, b) print(m)
If you run this, you should see the union of both arrays as a sorted numpy array.
$ python3 merge.py [0 1 2 3 3 4 5 6 7 7 9]
Two sorted numpy arrays can be intersected with the intersect
method, which takes two
numpy arrays and returns the sorted intersection of the two arrays.
# intersect.py import numpy as np import sortednp as snp a = np.array([0, 3, 4, 6, 7]) b = np.array([1, 2, 3, 5, 7, 9]) i = snp.intersect(a, b) print(i)
If you run this, you should see the intersection of both arrays as a sorted numpy array.
$ python3 intersect.py [3 7]
Returning array indices
The intersect
method takes an optional argument indices
which is False
by default. If this is set to True
, the return value consists of the
intersection array and a tuple with the indices of the common values for both
arrays.
# intersect_indices.py import numpy as np import sortednp as snp a = np.array([2,4,6,8,10]) b = np.array([1,2,3,4]) intersection, indices = snp.intersect(a,b, indices=True) print(intersection) print(indices)
The above example gives:
$ python3 intersect_indices.py [2 4] (array([0, 1]), array([1, 3]))
The first line shows the intersection of the two arrays. The second line
prints a tuple with the indices where the common values appeared in the input
arrays. For example, the value 4
is at position 1
in array a
and at position
3
in array b
.
kway methods
Similarly, the kway intersect and merge methods take two or more arrays and perform the merge or intersect operation on its arguments.
# kway_intersect.py import numpy as np import sortednp as snp a = np.array([0, 3, 4, 6, 7]) b = np.array([0, 3, 5, 7, 9]) c = np.array([1, 2, 3, 5, 7, 9]) d = np.array([2, 3, 6, 7, 8]) i = snp.kway_intersect(a, b, c, d) print(i)
If you run this, you should see the intersection of all four arrays as a sorted numpy array.
$ python3 kway_intersect.py [3 7]
The kway merger sortednp.kway_merge
works analogously. However, the native
numpy
implementation is faster compared to the merge provided by this package.
The kway merger has been added for completeness. The package heapq
provides
efficient methods to merge multiple arrays simultaneously.
The methods kway_merge
and kway_intersect
accept the optional keyword
argument assume_sorted
. By default, it is set to True
. If it is set to False
,
the method calls sort()
on the input arrays before performing the operation.
The default should be kept if the arrays are already sorted to save the time it
takes to sort the arrays.
Since the arrays might be too large to keep all of them in memory at the same
time, it is possible to pass a callable
instead of an array to the methods.
The callable
is expected to return the actual array. It is called immediately
before the array is required. This reduces the memory consumption.
Algorithms
Intersections are calculated by iterating both arrays. For a given element in
one array, the method needs to search the other and check if the element is
contained. In order to make this more efficient, we can use the fact that the
arrays are sorted. There are three search methods, which can be selected via the
optional keyword argument algorithm
.
sortednp.SIMPLE_SEARCH
: Search for an element by linearly iterating over the array elementbyelement. More Information.sortednp.BINARY_SEARCH
: Slice the remainder of the array in halves and repeat the procedure on the slice which contains the searched element. More Information.sortednp.GALLOPING_SEARCH
: First, search for an element linearly, doubling the step size after each step. If a step goes beyond the search element, perform a binary search between the last two positions. More Information.
The default is sortednp.GALLOPING_SEARCH
. The performance of all three
algorithms is compared in the next section.
Performance
The performance of the package can be compared with the default implementation of numpy, the intersect1d` method. The ratio of the execution time between sortednp and numpy is shown for various different benchmark tests.
The merge or intersect time can be estimated under two different assumptions. If the arrays, which are merged or intersected, are already sorted, one should not consider the time it takes to sort the random arrays in the benchmark. On the other hand, if one considers a scenario in which the arrays are not sorted, one should take the sorting time into account.
Intersect
The performance of the intersection operation depends on the sparseness of the two arrays. For example, if the first element of one of the arrays is larger than all elements in the other array, only the other array has to be searched (linearly, binarily, or exponentially). Similarly, if the common elements are far apart in the arrays (sparseness), large chunks of the arrays can be skipped. The arrays in the benchmark contain random (unique) integers. The sparseness is defined as the average difference between two consecutive elements in one array.
The first set of tests studies the performance dependence on the size of the arrays. The second set of tests studies the dependence on the sparseness of the array.
Assume sorted arrays
The following table summarizes the performance compared to numpy if one ignores the time it takes to sort the initial arrays.
Test  Simple Search  Binary Search  Galloping Search 

Intersect  
Intersect Sparseness 
Include sorting time
The following table summarizes the performance compared to numpy if one takes the time it takes to sort the initial arrays into account.
Test  Simple Search  Binary Search  Galloping Search 

Intersect  
Intersect Sparseness 
Merge
Assume sorted  Include sorting time 

Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size sortednp0.2.1cp34cp34mmanylinux1_i686.whl (75.8 kB)  File type Wheel  Python version cp34  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp34cp34mmanylinux1_x86_64.whl (72.9 kB)  File type Wheel  Python version cp34  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp35cp35mmanylinux1_i686.whl (76.0 kB)  File type Wheel  Python version cp35  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp35cp35mmanylinux1_x86_64.whl (73.1 kB)  File type Wheel  Python version cp35  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp36cp36mmanylinux1_i686.whl (76.1 kB)  File type Wheel  Python version cp36  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp36cp36mmanylinux1_x86_64.whl (73.2 kB)  File type Wheel  Python version cp36  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp37cp37mmanylinux1_i686.whl (76.3 kB)  File type Wheel  Python version cp37  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1cp37cp37mmanylinux1_x86_64.whl (73.4 kB)  File type Wheel  Python version cp37  Upload date  Hashes View hashes 
Filename, size sortednp0.2.1.tar.gz (21.9 kB)  File type Source  Python version None  Upload date  Hashes View hashes 
Hashes for sortednp0.2.1cp34cp34mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  4256d50ec5e16f1f881c0c53bfd0ad7e4a2db006d8abdb9d4b147fa0455e04c0 

MD5  0d3e3834a1f38d6f85f34e8d15fe9d4c 

BLAKE2256  a72b3947f42f95cb5e542e4dc29223576e425977d2ed810037003ba7f13a7ca9 
Hashes for sortednp0.2.1cp34cp34mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  e0e028dba34467f60c090dc36c1511551c7ce81f272f5ba776130f800839df30 

MD5  27f6a73bfb1fb3f1f3db66340fde50e7 

BLAKE2256  e099986e5eda216410450a51fdb8c1544f19dc791d0fbb3c8fc339a22675932f 
Hashes for sortednp0.2.1cp35cp35mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  aede1b5dcc494414a480a83d3198e420437005d2814e76fea7194c23f79abafa 

MD5  459d7116aa8ded752db521bffa364ebd 

BLAKE2256  bf8f6d5e4e2586281b6806255d27b5b8ae73e8d94cf4e8fe194fb5f4d807c55f 
Hashes for sortednp0.2.1cp35cp35mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  e379635630b73201b2b72beab12ebef6d21f5e8eedd2e2e054adc0a9fff5798b 

MD5  58d9b2468cbfd2f0af18b307456af34b 

BLAKE2256  9423fd3c2bf5119d67dc182346eb00f5927a6cda19514e57ef8b2ef4841668b0 
Hashes for sortednp0.2.1cp36cp36mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  8cea8bdfb854bdffcc525aad98bb123821d585a7599d4ab25e873f9f531eb2e9 

MD5  fcab51eb64b333d4fc4a3abdb83b37e2 

BLAKE2256  c13e632d3bf5e6aa6f82ca042a35c9d4522a2c7885111f2cf10b5ebbb498821e 
Hashes for sortednp0.2.1cp36cp36mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  03743e5ea112e15b1924e59a12d1b91b2c6e61aa4aa844c8fb0ce1662d92ed08 

MD5  7b1652baeadeec82ba35a5a2ce9be6c0 

BLAKE2256  728179660d0c690e60020c97678d35c501d7d59534ac9a9a5a3610bfcef7b7d9 
Hashes for sortednp0.2.1cp37cp37mmanylinux1_i686.whl
Algorithm  Hash digest  

SHA256  1f2d2ab92edff8821485372a6d69dde1096c5216e471d858a3f85d023a650be8 

MD5  c188c7a43390d762b4671350be462f10 

BLAKE2256  3e5f22820e50529d6962875acdc85a439baf000cd361acd853f12acb6a75fd2f 
Hashes for sortednp0.2.1cp37cp37mmanylinux1_x86_64.whl
Algorithm  Hash digest  

SHA256  97838da3589a16e34841ad2b1a1a8b5e7da74c4a7999b8a7ee48857b11ef82af 

MD5  c3d2e7b48fa3eb0fbf4b61407aca4d86 

BLAKE2256  e95c8e28dc1e5b1fc69f8d787a4721c1eea3523bb28bc9349d13049699d3f568 