Skip to main content

a k-way merge algorithm to replace heapq.merge

Project description

Multimerge

Multimerge is a Python package that implements an algorithm for lazily combining several sorted iterables into one longer sorted iterator. It is a drop-in replacement for heapq.merge in the Python standard library.

The API (from heapq.merge())

def merge(*iterables, key=None, reverse=False):
    '''Merge multiple sorted inputs into a single sorted output.

    Similar to sorted(itertools.chain(*iterables)) but returns a generator,
    does not pull the data into memory all at once, and assumes that each of
    the input streams is already sorted (smallest to largest).

    >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
    [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
    
    If *key* is not None, applies a key function to each element to determine
    its sort order.
    
    >>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))
    ['dog', 'cat', 'fish', 'horse', 'kangaroo']
    
    '''
    ...

Comparing the Algorithms

heapq.merge()

The standard-library implementation of merge, as its location suggests, involves maintaining a heap (priority queue) data structure. In this algorithm, each node of the heap stores both a unique item from an iterator as well as the index of the iterator from which the item came. This is how sort stability is maintained. At each step of the merge, the root of the heap is yielded, then a new item from that root's source iterator is found, which then replaces the root with a call to heapreplace.

multimerge.merge()

Multimerge uses a different data structure: a linked binary tree known as a "tournament tree of winners". It works as follows:

  • Each leaf node remembers a particular iterator, as well as the most recent item yielded from that iterator.

  • Each non-leaf node stores the highest-priority value among all of its descendent leaves' values, as well as which leaf that highest-priority value came from.

  • At a typical step of the process, the root of the tree stored the highest-priority value that was just yielded. To replace it, look at the root's stored reference to the leaf of the item most recently yielded. We will replace this leaf's value with a new value from its iterator. Finally, to restore the invariant about descendent values, we need to re-evaluate several "games of the tournament"; at each ancestor of the leaf, re-evaluate which of its two children should have the higher priority, then acquire that child's value and leaf references. When this is complete, the value at the root is the highest-priority, and can be yielded next.

  • When sorting values not by direct comparison but by some key function, it is cheaper to only bring the keys up the tree, keeping the values stored only at the leaves.

  • When a leaf's iterator is exhausted, it can be deleted so that its sibling can be promoted, shrinking the data structure as the problem reduces and maintaining the invariant that each non-leaf node has exactly two children.

Benefits of multimerge.merge()

  • There are fewer comparisons required on average, especially in the case where the input data has long runs where one particular iterator should "win".

  • In the heap model, during a heapreplace call, the (it_index, key, value) tuples are compared lexicographically. This involves identifying the first place where two tuples differ, which will require evaluating key1 == key2, followed then by evaluating key1 < key2. This is not necessary in multimerge.merge()'s tournament tree approach, since the order of the input iterators part of the tree structure, so sort stability comes naturally.

Both of the above are demonstrated below:

class Int(int):
    lt = eq = 0
    def __lt__(self, other):
        __class__.lt += 1
        return int.__lt__(self, other)

    def __eq__(self, other):
        __class__.eq += 1
        return int.__eq__(self, other)

def comparisons(mergefunc, iterables):
    Int.lt = Int.eq = 0
    for _ in mergefunc(*iterables):
        pass
    return Int.lt, Int.eq

no_overlap = [
    # (0..999), (1_000..1_999), (2_000..2_999), ...
    list(map(Int, range(x, x+1_000)))
    for x in range(0, 16_000, 1_000)
]

interleaved = [
    # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
    list(map(Int, range(x, 16_000, 16)))
    for x in range(16)
]

def test_merge_func(mergefunc):
    print("No overlap: {:,} lt; {:,} eq".format(
        *comparisons(mergefunc, no_overlap)))
    print("Interleaved: {:,} lt; {:,} eq".format(
        *comparisons(mergefunc, interleaved)))

if __name__ == "__main__":
    import heapq
    print("======= heapq.merge ======")
    test_merge_func(heapq.merge)
    print()

    import multimerge
    print("==== multimerge.merge ====")
    test_merge_func(multimerge.merge)
    print()

Result:

======= heapq.merge ======
No overlap: 65,004 lt; 65,004 eq
Interleaved: 64,004 lt; 64,004 eq

==== multimerge.merge ====
No overlap: 32,000 lt; 0 eq
Interleaved: 63,968 lt; 0 eq

  • These theoretical improvements, coupled with a fast C implementation, make multimerge up to 5 times faster than heapq for basic benchmarks:
py -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 80.8 ms +- 5.6 ms

py -m pyperf timeit -s "from random import random; from collections import deque; from multimerge import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 17.1 ms +- 1.4 ms

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multimerge-0.2.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distributions

multimerge-0.2.0-cp311-cp311-win_amd64.whl (11.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

multimerge-0.2.0-cp311-cp311-win32.whl (11.2 kB view details)

Uploaded CPython 3.11 Windows x86

multimerge-0.2.0-cp311-cp311-musllinux_1_1_x86_64.whl (32.0 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

multimerge-0.2.0-cp311-cp311-musllinux_1_1_i686.whl (31.7 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.1+ i686

multimerge-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

multimerge-0.2.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (26.1 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

multimerge-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl (9.1 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

multimerge-0.2.0-cp310-cp310-win_amd64.whl (12.0 kB view details)

Uploaded CPython 3.10 Windows x86-64

multimerge-0.2.0-cp310-cp310-win32.whl (11.3 kB view details)

Uploaded CPython 3.10 Windows x86

multimerge-0.2.0-cp310-cp310-musllinux_1_1_x86_64.whl (32.4 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

multimerge-0.2.0-cp310-cp310-musllinux_1_1_i686.whl (31.8 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.1+ i686

multimerge-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

multimerge-0.2.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (28.0 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

multimerge-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl (9.3 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

multimerge-0.2.0-cp39-cp39-win_amd64.whl (12.0 kB view details)

Uploaded CPython 3.9 Windows x86-64

multimerge-0.2.0-cp39-cp39-win32.whl (11.3 kB view details)

Uploaded CPython 3.9 Windows x86

multimerge-0.2.0-cp39-cp39-musllinux_1_1_x86_64.whl (32.2 kB view details)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

multimerge-0.2.0-cp39-cp39-musllinux_1_1_i686.whl (31.6 kB view details)

Uploaded CPython 3.9 musllinux: musl 1.1+ i686

multimerge-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28.6 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64 manylinux: glibc 2.5+ x86-64

multimerge-0.2.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl (27.7 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ i686 manylinux: glibc 2.5+ i686

multimerge-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl (9.3 kB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

File details

Details for the file multimerge-0.2.0.tar.gz.

File metadata

  • Download URL: multimerge-0.2.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for multimerge-0.2.0.tar.gz
Algorithm Hash digest
SHA256 a2af38eddb0a7a1190e0566ade7b75e11922b15bb4280ccbdf04cd7291349b14
MD5 8d48ff244bd714a5fde4c48d7829c26a
BLAKE2b-256 34923058842d382ac2afb98d54fd76c2ced737765e0ff874e818c0ddf03ef7b7

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 1d8e39c80baa669c0e331db1777636d6548cf7a87d15498f8314ea8ea3c64930
MD5 a7a6f0bcf06412525f47627c755a593f
BLAKE2b-256 4e2e7e151fa28cf1f56ca9fcf01ef08584cd5560b2d73ea0db720436d065a1ee

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-win32.whl.

File metadata

  • Download URL: multimerge-0.2.0-cp311-cp311-win32.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: CPython 3.11, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-win32.whl
Algorithm Hash digest
SHA256 2cd59b40f4456746a011108af20dbaa5102aa14a9311f52dd3a1bd54aa515008
MD5 93a20da4c4f6fe504070bd994250d4fd
BLAKE2b-256 84ed067f9e91cdc69b6a01603e35613a80e35f6497912422092ef3df42beb49d

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 469300371cf04ce29f866627a70da5cd9beee8640992dd14090bad000c5c583f
MD5 be1b0503cb02ef61096002a8aefd2fee
BLAKE2b-256 bc735f62cc1165eb1b37d15b043f8fb4c6466482e5a81ccaeef6bf5dcde94b96

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 4b930b8cd9423ed000d22d47769c0bc45bcd6bafac08b922f2e25086266f28a7
MD5 4b5560636879a3b8205ca9835e6aeba7
BLAKE2b-256 e6081d119e242bdb1d4d0146ef4ab15f3657ea4ffc62ee25a6bc214b8920409a

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aff5d0e1729640c849eac3b3956eebf454f3c0ace181887dfed6a6bc80c1a9a5
MD5 d5d6682504448c14957b5dc39495095e
BLAKE2b-256 22cb7ac12912df9e44122d66e74c12b1f899b229747925ccdb368908e708658b

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 362fa4f8ad376350a2db4486102e14f56158032e6976f1edc6ac5c041fc1223c
MD5 f66a40624d3529874a817252159f4bfe
BLAKE2b-256 d7c100d3fb4ebaebb2480c86974b2d4a1b2ac0d605b43e605d0157ef314e505c

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6cc1d5236df21ef4d9dc4dd31a96589c6551926600ca9fc8f0d0e40cf32058df
MD5 5072af782a0d55f3f8237f03f7b02602
BLAKE2b-256 d34c31e7b6bdcab9d0dd03b3807e401b9262bc3ad8a7fe795c923457de9bcc9a

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7d8a35b22da7e28c1e79c5177929464c6d5ceac272c6589cadfc485c4224fffb
MD5 ca3af31055a25ba1c50daa5de958ec0b
BLAKE2b-256 2b6b61fe9b81d5b234253d17d981eabdac6ec949b118584b162109f06ff79758

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-win32.whl.

File metadata

  • Download URL: multimerge-0.2.0-cp310-cp310-win32.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: CPython 3.10, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-win32.whl
Algorithm Hash digest
SHA256 af7ee57218b89243267a959852df7ff43598456f995a057133c4e6fece059289
MD5 a55168e4b18f85d5d26b0f7c374b455c
BLAKE2b-256 6a84f03151a8e5c53b2282aa63d44754b1bd1bd989e991e57b639f5ec6041bad

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 2900664e1df2196894d17e5e5a00f24d8382a400e68e4a72ec81833d61b32f4f
MD5 0e3238c8f434efb031cf56b3e2c6f579
BLAKE2b-256 2792386f5b09e6b2884a663f01a63d73e624b6b492ac22fcf11f1506cb6a5ea1

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 68ea2ba516afe503a5e0e15f17b1aca76ae68288fdb1852e2405f60e64b9bf69
MD5 48ac62e5263a5dfb56b42fde2242f3f8
BLAKE2b-256 4325d21a60021f69d3f199cf20a650f172e359f58f2d43e9e283ab5e57f81010

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0162a1983e65a5d10f547c970bb5ebdc88bd2d2e0f26ba13736784e8a8101126
MD5 8ea4677ae3b4d94b54da5e8b3f75f2ea
BLAKE2b-256 e5e72d0430cf5f20900a59351387f18e4b3981849e3072611aef5d78cdb14f85

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 552334c55fb3c56c51f4762547822b3d938932be69780fa28816ed73a3de627c
MD5 d9dd00ba1222d9f50e8bbff703bc2389
BLAKE2b-256 50455c790d34f15fda64aedff4cc6c8a61674f0bc76e97ec9c88b01cb314cc87

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c0691adc2709d6d7c0a3dd8857684bb1b3c4905c55a42496ecc95e80bf49fa80
MD5 a0d4fb64bc26e1f83145b76ea853b6ca
BLAKE2b-256 cab6256346fe275e982cbb9201d3a2cecb333cfc55adba11595ecde03e5a5cd5

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 6adf03ab9e27dc5e123b069d92b407a21ea50541a79ec3726d4e9f399c82f2ed
MD5 e8ed1aca1b81851552579b4008b4084e
BLAKE2b-256 edb3ef5e4e30b3119594c9b418bcee0340a97ad5a73c2fc3d61f109672f0b8d6

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-win32.whl.

File metadata

  • Download URL: multimerge-0.2.0-cp39-cp39-win32.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.17

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 a2bd06d4c2906ac1c2cc55964a8e8464097194f1d960607a27ddaf74a3c476c2
MD5 78cc2a17707c2703cbb35225660716ad
BLAKE2b-256 86382f2e22dab9e570e62f2dec516854f44d8166874c8bda74afc26d73181eba

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-musllinux_1_1_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-musllinux_1_1_x86_64.whl
Algorithm Hash digest
SHA256 e7d999d467c402ddc2f968bf5bdc7eea9669f739bb106bbbe6a7c5852ee89ae8
MD5 62abe782c277eaaba377288a77f54fd6
BLAKE2b-256 6c45f0583d8a342bfbe2292ea82482c8f605bc06ba283b65c51e708919f591bb

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-musllinux_1_1_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-musllinux_1_1_i686.whl
Algorithm Hash digest
SHA256 424f9ecb7a18196136ce4e2ae066d7ed8306310377d72cc119b1256efe3c7142
MD5 efa73beb2637808ab53f4d03961fee88
BLAKE2b-256 848ce8b044ef51283541a924b4088996c8e35626772b69d0e5c81573c06521e4

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a9c368d6ac66ccb889fbb37b6a3601655f59d3d8a9a437c2d1d33e6ef6c7814b
MD5 b12070521d2e4d6337a46e5cee03bebf
BLAKE2b-256 a17b76748a30534174cbf0e61a929c55ee65bc2cd0e89a5dfa9727bdb30790d4

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 6117a686571fea2a10592296fef3f9fc3ede4914209738e38ef02e3bb1a1650c
MD5 2edd30fb0bd94c058cfb90e4a1af3720
BLAKE2b-256 83daa7eee6829a7bb12b4b9d37d3837ab8441d3561ca0df3846a96165e335ba1

See more details on using hashes here.

File details

Details for the file multimerge-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for multimerge-0.2.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6a5d972135566369f2c26762b8bec68728482e383d251494343e574d73dd0747
MD5 e61bf8633ee94a065daebe8f695048b2
BLAKE2b-256 fc2df937de540720e1027d59497fb7df84fb4b91b9c7d95d1ed8c61e50b3a39e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page