Skip to main content

Groups list items using a compare function

Project description

grplist

Group list or dict items using an arbitrary comparison function. Items that satisfy the predicate are placed in the same group, with transitivity applied automatically — if A matches B and B matches C, all three end up together even if A and C would not match directly.

import grplist as gl

groups = gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

Table of contents


Install

pip install grplist

Quick start

import grplist as gl

# Group numbers that are within 3 of each other
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

# Return indices instead of values (vals=False is the default)
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3)
# => [[0, 1, 2], [3, 4, 5], [6], [7]]

# Group a dict — returns groups of keys by default
d = {'a': 1, 'b': 3, 'c': 10, 'd': 12}
gl.groupDict(d, lambda a, b: abs(a-b) <= 3)
# => [['a', 'b'], ['c', 'd']]

API reference

All functions share the same signature pattern:

groupList(arr, fnc, vals=False)  -> list[list]
groupList2(arr, fnc, vals=False) -> list[list]
groupList3(arr, fnc, vals=False) -> list[list]
groupList4(arr, fnc, vals=False) -> list[list]
groupDict(obj, fnc, vals=False)  -> list[list]
groupDict2(obj, fnc, vals=False) -> list[list]
groupDict3(obj, fnc, vals=False) -> list[list]
groupDict4(obj, fnc, vals=False) -> list[list]

Parameters

Parameter Type Description
arr / obj list / dict The collection to group
fnc callable(a, b) -> bool Returns True if two items belong in the same group
vals bool True → return item values; False (default) → return indices (list) or keys (dict)

Return value

A list of groups. Each group is a list of either values (when vals=True) or indices/keys (when vals=False).

Choosing between the variants

All variants produce equivalent results for symmetric, transitive predicates. The differences are performance characteristics:

Variant Predicate calls Merge cost When to use
groupList All n·(n−1)/2 pairs O(n) per merge Baseline; simplest code
groupList2 ~5–30% of pairs (chain-walk) O(chain length) per merge Good general choice
groupList3 All pairs (lazy find) O(α(n)) — Union-Find Very large n where merge cost dominates
groupList4 ~5–30% of pairs (same as groupList2) O(1) — tail pointer Best general-purpose default

groupList4 is the recommended default. It combines groupList2's chain-walk comparison (far fewer predicate calls than a full scan) with an O(1) merge step using a tail pointer per group — eliminating the only remaining inefficiency in groupList2. It is never slower than groupList2 and marginally faster whenever groups are merged.

groupDict, groupDict2, groupDict3, and groupDict4 are thin wrappers that extract a dict's values, delegate to the corresponding groupList variant, then remap indices back to keys.


Examples

Numeric proximity

import grplist as gl

# Items within 3 of each other (values)
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

# Note: 7 bridges what would otherwise be two separate groups
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35, 7, 23], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6, 10, 12, 14, 7], [21, 23], [35]]

Grouping by shared characters

import grplist as gl

def shares_a_letter(a, b):
    return any(c in b for c in a)

words = ['on', 'tw', 'th', 'fo', 'fi', 'si', 'te', 'zk']
gl.groupList4(words, shares_a_letter, vals=True)
# => [['on', 'fo', 'fi', 'si'], ['tw', 'te', 'th'], ['zk']]

Grouping dict items by value

import grplist as gl

scores = {'alice': 91, 'bob': 88, 'carol': 74, 'dave': 76, 'eve': 95}

# Group people whose scores are within 5 points of each other
gl.groupDict(scores, lambda a, b: abs(a-b) <= 5)
# => [['alice', 'bob', 'eve'], ['carol', 'dave']]

# Return the score values instead of names
gl.groupDict(scores, lambda a, b: abs(a-b) <= 5, vals=True)
# => [[91, 88, 95], [74, 76]]

Grouping objects with a custom predicate — overlapping time ranges

import grplist as gl

tracks = [
    {'start': 2,  'end': 10},
    {'start': 8,  'end': 17},
    {'start': 4,  'end': 7},
    {'start': 20, 'end': 25},
    {'start': 22, 'end': 28},
    {'start': 30, 'end': 45},
]

def overlaps(a, b):
    return a['start'] <= b['end'] and a['end'] >= b['start']

groups = gl.groupList4(tracks, overlaps, vals=True)
# Group 1: tracks 0, 1, 2  (2-10, 8-17, 4-7 all overlap)
# Group 2: tracks 3, 4     (20-25, 22-28 overlap)
# Group 3: track 5         (30-45, no overlap with others)

How it works

The underlying problem: connected components

grplist solves the connected components problem from graph theory. Given your list of items, think of each item as a node in an undirected graph. Your comparison function fnc(a, b) defines the edges: an edge is drawn between two nodes whenever fnc returns True. The groups returned by grplist are the connected components of that graph — the clusters of nodes that are reachable from each other through any chain of edges.

This naturally handles transitivity. If item A connects to item B, and item B connects to item C, then A, B, and C all end up in the same group — even if fnc(A, C) would return False.

A classic illustration: grouping integers where any two values within 3 of each other are connected.

Items:  1 — 3 — 6     10 — 12 — 14     21     35
                  \   /
              (gap > 3)

Components:  [1, 3, 6]   [10, 12, 14]   [21]   [35]

If we add 7 to the list, it sits within 3 of both 6 and 10, acting as a bridge that merges the first two components:

Items:  1 — 3 — 6 — 7 — 10 — 12 — 14     21 — 23     35

Components:  [1, 3, 6, 7, 10, 12, 14]   [21, 23]   [35]

groupList — group-ID scan

groupList performs a complete O(n²) pairwise scan. Each item is assigned a group ID, and when two items with different IDs are found to match, their groups are merged by renumbering all affected entries. Conceptually similar to Union-Find but without path compression.

groupList2 — chain-walk comparison

groupList2 maintains a linked list per group and processes items one at a time. Each new item is compared only against existing group members (not all pairs), and is inserted adjacent to its match. This reduces predicate calls to ~5–30% of the full n² scan on typical inputs. The merge step walks the chain to find its tail, which is O(chain length).

groupList3 — Union-Find merge

groupList3 uses a full O(n²) comparison scan (like groupList) but replaces the O(n) merge step with Union-Find path compression and union-by-rank, making each merge O(α(n)) — effectively constant. find() is called lazily, only on True predicate results. Useful when n is very large and the merge cost dominates.

groupList4 — chain-walk + tail-pointer merge (recommended)

groupList4 combines the two independent optimisations: groupList2's chain-walk comparison (same predicate call count and insertion order) with a t[] tail-pointer array that makes the merge step O(1) instead of O(chain length). It is strictly better than or equal to groupList2 on every input.

Performance guide

grplist performance guide

The three panels above show:

  1. Predicate call countsgroupList and groupList3 check every pair (flat lines at 100%). groupList2 and groupList4 overlap — both use the same chain-walk comparison and make identical predicate calls, dropping to ~5–30% of all pairs as connectivity grows.

  2. Wall time by scenario — the star marks the fastest function. groupList4 (lavender) wins in most conditions, matching groupList2's call count while eliminating the chain-end walk during merges. On very sparse inputs with a cheap predicate, groupList edges ahead due to its simpler inner loop.

  3. Winner heatmap — which function is fastest at every combination of graph connectivity and predicate cost. groupList4 (lavender) dominates for medium and dense graphs at all predicate costs. groupList3 (green) wins the sparse

    • cheap corner: when almost no pairs match, its simple if not fnc: continue hot path has slightly less per-iteration overhead than the other variants.

To regenerate the chart after changing the code:

pip install matplotlib numpy
python bench/generate_charts.py

Further reading


Comparison to similar projects

Several other libraries solve the same underlying problem. The right choice depends on your data types, scale, and what else you need to do with the results.

NetworkX — nx.connected_components(G)

NetworkX is the standard Python graph library. To replicate grplist, you build a Graph object, call your predicate for every pair of items to add edges, then call nx.connected_components(G). This gives you full connected-component detection backed by a mature, well-tested library. The result is a generator of sets of node labels rather than grouped sub-lists, so a small amount of post-processing is needed to get the same output shape. NetworkX carries substantial dependencies (NumPy, and optionally SciPy) and is heavyweight for this single task, but it pays off when you need additional graph analysis — shortest paths, centrality, community detection, etc.

Best for: Projects already using NetworkX, or any case where connected components is one step in a broader graph analysis pipeline.

SciPy — scipy.sparse.csgraph.connected_components

SciPy's graph module provides a fast, C-backed connected-components implementation that operates on a sparse adjacency matrix. For large numerical datasets it significantly outperforms grplist, but you must first express your predicate as a numeric matrix, which requires your items to be representable as indices. Non-numeric or heterogeneous objects need a manual encoding step. The result is a NumPy label array rather than grouped sub-lists.

Best for: Large numerical datasets (thousands of items or more) where performance matters and items can be naturally represented as a matrix.

scikit-learn — AgglomerativeClustering(linkage='single')

scikit-learn's single-linkage clustering is mathematically equivalent to connected components when the distance threshold is set to match your grouping criterion. However, it requires a numeric distance metric (or a precomputed distance matrix) rather than an arbitrary boolean predicate — you cannot pass in a function that operates on strings, dicts, or custom objects without first embedding them in a numeric space. Results come back as a flat integer label array. It also requires specifying the number of clusters or a distance threshold in advance, and pulls in the full scikit-learn dependency.

Best for: Metric-based clustering problems that are already part of a machine learning or data science pipeline using scikit-learn.

disjoint-set — PyPI package disjoint-set

The disjoint-set package is a clean, lightweight Union-Find / Disjoint Set Union implementation with proper path compression and union-by-rank. It is algorithmically more efficient than grplist for large inputs and supports incremental / streaming use (you can call union at any time). The tradeoff is that it exposes a data structure rather than a one-call function — you write the iteration loop yourself:

from disjoint_set import DisjointSet
ds = DisjointSet()
for i, a in enumerate(items):
    for j, b in enumerate(items):
        if i < j and predicate(a, b):
            ds.union(a, b)
groups = ds.itersets()

This gives full control, but grplist is more convenient when you just want to hand over a list and get groups back.

Best for: Performance-critical code, incremental/streaming grouping, or when you want to manage the data structure directly.

Summary table

grplist NetworkX SciPy csgraph scikit-learn disjoint-set
One-call API (list in, groups out) Yes No No No No
Accepts arbitrary boolean predicate Yes With a loop With a loop No With a loop
Works with any Python object Yes Yes No No Yes
Returns grouped sub-lists directly Yes No (sets) No (label array) No (label array) No (itersets)
No external dependencies Yes No No No Yes
Efficient at n > 10,000 No (O(n²)) Yes Yes Yes Yes
Full graph analysis capabilities No Yes Partial No No

In summary: grplist is the right choice when you want a single function call, your predicate is arbitrary (non-numeric, custom objects, strings, etc.), your list has at most a few thousand items, and you do not want to pull in external dependencies. For larger inputs, or when you are already working within NetworkX, SciPy, or scikit-learn, those tools will serve you better.


References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grplist-0.1.5.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

grplist-0.1.5-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file grplist-0.1.5.tar.gz.

File metadata

  • Download URL: grplist-0.1.5.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for grplist-0.1.5.tar.gz
Algorithm Hash digest
SHA256 56d0a8bcce6e45efd7d155c439b62f882e183451ebdffa3c9782bb42d3208dfb
MD5 303e64c9fe2a063b1fd6fc25a734dedb
BLAKE2b-256 675a7ad49dd7303a56252e3d270615eb4e471a2bcf9c88daf37e2cd10aff4d45

See more details on using hashes here.

File details

Details for the file grplist-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: grplist-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for grplist-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 94794c9e9361db29ffdf7dffd0cc570efd55d47f514d376952a89f64c664b4ec
MD5 914660c4dce81e9f826759f9d2e5aed6
BLAKE2b-256 9eac9c3e0c482f9cd75bb97568254300cd0ed679a3a2e1e10cb3538517bb2936

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page