Skip to main content

Lazy iterables for full outer join, inner join, and left and right join

Project description

full_outer_join

Lazy iterator implementations of a full outer join, inner join, and left and right joins of Python iterables.

This implements the sort-merge join, better known as the merge join, to join iterables in O(n) time with respect to the length of the longest iterable.

Note that the algorithm requires input to be sorted by the join key.

Example

(whitespace to make things explicit)

>>> list(full_outer_join.full_outer_join(
    [{"id": 1, "val": "foo"}                         ],
    [{"id": 1, "val": "bar"}, {"id": 2, "val": "baz"}],
    key=lambda x: x["id"]
))

[
    (1, ([{'id': 1, 'val': 'foo'}], [{'id': 1, 'val': 'bar'}])),
    (2, ([                       ], [{'id': 2, 'val': 'baz'}]))
]

To consume the output, your business logic might look like:

for group_key, key_batches in full_outer_join.full_outer_join(left, right):
    left_rows, right_rows = key_batches
    
    if left_rows and right_rows:
        # This is the inner join case.
        pass
    elif left_rows and not right_rows:
        # This is the left join case (no matching right rows)
        pass
    elif not left_rows and right_rows:
        # This is the right join case (no matching left rows)
        pass
    elif not left_rows and not right_rows:
        raise Exception("Unreachable")

Functions

name description
full_outer_join(*iterables, key=lambda x: x) Do a full outer join on any number of iterables, returning (key, (list[row], ...)) for each key across all iterables.
inner_join(*iterables, key=lambda x: x) Do an inner join across all iterables, returning (key, (list[row], ...)) for keys only in all iterables
left_join(left_iterable, right_iterable, key=lambda x: x) Do a left join on both iterables, returning keys for each unique key in left_iterable
right_join(left_iterable, right_iterable, key=lambda x: x) Do a right join on both iterables, returning keys for each unique key in right_iterable
cross_join(join_output, null=None) Do the cross (Cartesian) join on the output of full_outer_join or inner_join, yielding (key, (iter1_row, ...)) for each row. This is implemented for completeness and is probably not useful. Iterables lacking any rows for key are replaced with null in the output.

Why?

  1. Your input is already sorted and you don't want to consume your input iterators.
  2. Your business logic that consumes the joined output benefits from explicitly handling the match and no-match cases from each input iterable.
  3. You're insane. Your brain is irreparably broken by the relational model.

More examples

See test_insanity.py for a silly example of a SQL query hand-compiled into iterators.

Thanks

This was originally a PR to the more_itertools project who gave some excellent feedback on the design but ultimately did not want to merge it in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

full_outer_join-1.0.0.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

full_outer_join-1.0.0-py3-none-any.whl (5.3 kB view details)

Uploaded Python 3

File details

Details for the file full_outer_join-1.0.0.tar.gz.

File metadata

  • Download URL: full_outer_join-1.0.0.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for full_outer_join-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f8a22b73e6d30070f48522d2a01d56f64c6c0411d4468cc88854ce969c43e8e1
MD5 48417c518a19b649f12a3d55545cf21d
BLAKE2b-256 fe02ff8350797577527ed7868070f8a689d7a3ad7a259f7752820f72d85edba9

See more details on using hashes here.

File details

Details for the file full_outer_join-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for full_outer_join-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f5c02dc0b11f4380f6c5645c2737190c49e4e6e5b9fafec2ea01c72d367b92d8
MD5 c0463c5591fb84dd900484c35c2e766c
BLAKE2b-256 a70bf71655666bd21eec2256a3a9c265e5491d43f6697caf313b9795301256ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page