Groups list items using a compare function

Project description

grplist

Group list or dict items using an arbitrary comparison function. Items that satisfy the predicate are placed in the same group, with transitivity applied automatically — if A matches B and B matches C, all three end up together even if A and C would not match directly.

import grplist as gl

groups = gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

Install
Quick start
API reference
Examples
How it works
Comparison to similar projects
References

Install

pip install grplist

Quick start

import grplist as gl

# Group numbers that are within 3 of each other
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

# Return indices instead of values (vals=False is the default)
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3)
# => [[0, 1, 2], [3, 4, 5], [6], [7]]

# Group a dict — returns groups of keys by default
d = {'a': 1, 'b': 3, 'c': 10, 'd': 12}
gl.groupDict(d, lambda a, b: abs(a-b) <= 3)
# => [['a', 'b'], ['c', 'd']]

API reference

All functions share the same signature pattern:

groupList(arr, fnc, vals=False)  -> list[list]
groupList2(arr, fnc, vals=False) -> list[list]
groupList3(arr, fnc, vals=False) -> list[list]
groupList4(arr, fnc, vals=False) -> list[list]
groupDict(obj, fnc, vals=False)  -> list[list]
groupDict2(obj, fnc, vals=False) -> list[list]
groupDict3(obj, fnc, vals=False) -> list[list]
groupDict4(obj, fnc, vals=False) -> list[list]

Parameters

Parameter	Type	Description
`arr` / `obj`	`list` / `dict`	The collection to group
`fnc`	`callable(a, b) -> bool`	Returns `True` if two items belong in the same group
`vals`	`bool`	`True` → return item values; `False` (default) → return indices (list) or keys (dict)

Return value

A list of groups. Each group is a list of either values (when vals=True) or indices/keys (when vals=False).

Choosing between the variants

All variants produce equivalent results for symmetric, transitive predicates. The differences are performance characteristics:

Variant	Predicate calls	Merge cost	When to use
`groupList`	All n·(n−1)/2 pairs	O(n) per merge	Baseline; simplest code
`groupList2`	~5–30% of pairs (chain-walk)	O(chain length) per merge	Good general choice
`groupList3`	All pairs (lazy find)	O(α(n)) — Union-Find	Very large n where merge cost dominates
`groupList4`	~5–30% of pairs (same as groupList2)	O(1) — tail pointer	Best general-purpose default

groupList4 is the recommended default. It combines groupList2's chain-walk comparison (far fewer predicate calls than a full scan) with an O(1) merge step using a tail pointer per group — eliminating the only remaining inefficiency in groupList2. It is never slower than groupList2 and marginally faster whenever groups are merged.

groupDict, groupDict2, groupDict3, and groupDict4 are thin wrappers that extract a dict's values, delegate to the corresponding groupList variant, then remap indices back to keys.

Examples

Numeric proximity

import grplist as gl

# Items within 3 of each other (values)
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6], [10, 12, 14], [21], [35]]

# Note: 7 bridges what would otherwise be two separate groups
gl.groupList([1, 3, 6, 10, 12, 14, 21, 35, 7, 23], lambda a, b: abs(a-b) <= 3, vals=True)
# => [[1, 3, 6, 10, 12, 14, 7], [21, 23], [35]]

Grouping by shared characters

import grplist as gl

def shares_a_letter(a, b):
    return any(c in b for c in a)

words = ['on', 'tw', 'th', 'fo', 'fi', 'si', 'te', 'zk']
gl.groupList4(words, shares_a_letter, vals=True)
# => [['on', 'fo', 'fi', 'si'], ['tw', 'te', 'th'], ['zk']]

Grouping dict items by value

import grplist as gl

scores = {'alice': 91, 'bob': 88, 'carol': 74, 'dave': 76, 'eve': 95}

# Group people whose scores are within 5 points of each other
gl.groupDict(scores, lambda a, b: abs(a-b) <= 5)
# => [['alice', 'bob', 'eve'], ['carol', 'dave']]

# Return the score values instead of names
gl.groupDict(scores, lambda a, b: abs(a-b) <= 5, vals=True)
# => [[91, 88, 95], [74, 76]]

Grouping objects with a custom predicate — overlapping time ranges

import grplist as gl

tracks = [
    {'start': 2,  'end': 10},
    {'start': 8,  'end': 17},
    {'start': 4,  'end': 7},
    {'start': 20, 'end': 25},
    {'start': 22, 'end': 28},
    {'start': 30, 'end': 45},
]

def overlaps(a, b):
    return a['start'] <= b['end'] and a['end'] >= b['start']

groups = gl.groupList4(tracks, overlaps, vals=True)
# Group 1: tracks 0, 1, 2  (2-10, 8-17, 4-7 all overlap)
# Group 2: tracks 3, 4     (20-25, 22-28 overlap)
# Group 3: track 5         (30-45, no overlap with others)

How it works

The underlying problem: connected components

grplist solves the connected components problem from graph theory. Given your list of items, think of each item as a node in an undirected graph. Your comparison function fnc(a, b) defines the edges: an edge is drawn between two nodes whenever fnc returns True. The groups returned by grplist are the connected components of that graph — the clusters of nodes that are reachable from each other through any chain of edges.

This naturally handles transitivity. If item A connects to item B, and item B connects to item C, then A, B, and C all end up in the same group — even if fnc(A, C) would return False.

A classic illustration: grouping integers where any two values within 3 of each other are connected.

Items:  1 — 3 — 6     10 — 12 — 14     21     35
                  \   /
              (gap > 3)

Components:  [1, 3, 6]   [10, 12, 14]   [21]   [35]

If we add 7 to the list, it sits within 3 of both 6 and 10, acting as a bridge that merges the first two components:

Items:  1 — 3 — 6 — 7 — 10 — 12 — 14     21 — 23     35

Components:  [1, 3, 6, 7, 10, 12, 14]   [21, 23]   [35]

`groupList` — group-ID scan

groupList performs a complete O(n²) pairwise scan. Each item is assigned a group ID, and when two items with different IDs are found to match, their groups are merged by renumbering all affected entries. Conceptually similar to Union-Find but without path compression.

`groupList2` — chain-walk comparison

groupList2 maintains a linked list per group and processes items one at a time. Each new item is compared only against existing group members (not all pairs), and is inserted adjacent to its match. This reduces predicate calls to ~5–30% of the full n² scan on typical inputs. The merge step walks the chain to find its tail, which is O(chain length).

`groupList3` — Union-Find merge

groupList3 uses a full O(n²) comparison scan (like groupList) but replaces the O(n) merge step with Union-Find path compression and union-by-rank, making each merge O(α(n)) — effectively constant. find() is called lazily, only on True predicate results. Useful when n is very large and the merge cost dominates.

`groupList4` — chain-walk + tail-pointer merge (recommended)

groupList4 combines the two independent optimisations: groupList2's chain-walk comparison (same predicate call count and insertion order) with a t[] tail-pointer array that makes the merge step O(1) instead of O(chain length). It is strictly better than or equal to groupList2 on every input.

Performance guide

grplist performance guide

The three panels above show:

Predicate call counts — groupList and groupList3 check every pair (flat lines at 100%). groupList2 and groupList4 overlap — both use the same chain-walk comparison and make identical predicate calls, dropping to ~5–30% of all pairs as connectivity grows.
Wall time by scenario — the star marks the fastest function. groupList4 (lavender) wins in most conditions, matching groupList2's call count while eliminating the chain-end walk during merges. On very sparse inputs with a cheap predicate, groupList edges ahead due to its simpler inner loop.
Winner heatmap — which function is fastest at every combination of graph connectivity and predicate cost. groupList4 (lavender) dominates for medium and dense graphs at all predicate costs. groupList3 (green) wins the sparse
- cheap corner: when almost no pairs match, its simple if not fnc: continue hot path has slightly less per-iteration overhead than the other variants.

To regenerate the chart after changing the code:

pip install matplotlib numpy
python bench/generate_charts.py

Comparison to similar projects

Several other libraries solve the same underlying problem. The right choice depends on your data types, scale, and what else you need to do with the results.

NetworkX — `nx.connected_components(G)`

NetworkX is the standard Python graph library. To replicate grplist, you build a Graph object, call your predicate for every pair of items to add edges, then call nx.connected_components(G). This gives you full connected-component detection backed by a mature, well-tested library. The result is a generator of sets of node labels rather than grouped sub-lists, so a small amount of post-processing is needed to get the same output shape. NetworkX carries substantial dependencies (NumPy, and optionally SciPy) and is heavyweight for this single task, but it pays off when you need additional graph analysis — shortest paths, centrality, community detection, etc.

Best for: Projects already using NetworkX, or any case where connected components is one step in a broader graph analysis pipeline.

SciPy — `scipy.sparse.csgraph.connected_components`

SciPy's graph module provides a fast, C-backed connected-components implementation that operates on a sparse adjacency matrix. For large numerical datasets it significantly outperforms grplist, but you must first express your predicate as a numeric matrix, which requires your items to be representable as indices. Non-numeric or heterogeneous objects need a manual encoding step. The result is a NumPy label array rather than grouped sub-lists.

Best for: Large numerical datasets (thousands of items or more) where performance matters and items can be naturally represented as a matrix.

scikit-learn — `AgglomerativeClustering(linkage='single')`

scikit-learn's single-linkage clustering is mathematically equivalent to connected components when the distance threshold is set to match your grouping criterion. However, it requires a numeric distance metric (or a precomputed distance matrix) rather than an arbitrary boolean predicate — you cannot pass in a function that operates on strings, dicts, or custom objects without first embedding them in a numeric space. Results come back as a flat integer label array. It also requires specifying the number of clusters or a distance threshold in advance, and pulls in the full scikit-learn dependency.

Best for: Metric-based clustering problems that are already part of a machine learning or data science pipeline using scikit-learn.

disjoint-set — PyPI package `disjoint-set`

The disjoint-set package is a clean, lightweight Union-Find / Disjoint Set Union implementation with proper path compression and union-by-rank. It is algorithmically more efficient than grplist for large inputs and supports incremental / streaming use (you can call union at any time). The tradeoff is that it exposes a data structure rather than a one-call function — you write the iteration loop yourself:

from disjoint_set import DisjointSet
ds = DisjointSet()
for i, a in enumerate(items):
    for j, b in enumerate(items):
        if i < j and predicate(a, b):
            ds.union(a, b)
groups = ds.itersets()

This gives full control, but grplist is more convenient when you just want to hand over a list and get groups back.

Best for: Performance-critical code, incremental/streaming grouping, or when you want to manage the data structure directly.

Summary table

	grplist	NetworkX	SciPy csgraph	scikit-learn	disjoint-set
One-call API (list in, groups out)	Yes	No	No	No	No
Accepts arbitrary boolean predicate	Yes	With a loop	With a loop	No	With a loop
Works with any Python object	Yes	Yes	No	No	Yes
Returns grouped sub-lists directly	Yes	No (sets)	No (label array)	No (label array)	No (itersets)
No external dependencies	Yes	No	No	No	Yes
Efficient at n > 10,000	No (O(n²))	Yes	Yes	Yes	Yes
Full graph analysis capabilities	No	Yes	Partial	No	No

In summary: grplist is the right choice when you want a single function call, your predicate is arbitrary (non-numeric, custom objects, strings, etc.), your list has at most a few thousand items, and you do not want to pull in external dependencies. For larger inputs, or when you are already working within NetworkX, SciPy, or scikit-learn, those tools will serve you better.

References

Project details

Release history Release notifications | RSS feed

This version

0.1.5

May 11, 2026

0.1.4

Aug 8, 2021

0.1.3

Jul 23, 2021

0.1.1

Jul 23, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grplist-0.1.5.tar.gz (21.5 kB view details)

Uploaded May 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grplist-0.1.5-py3-none-any.whl (11.9 kB view details)

Uploaded May 11, 2026 Python 3

File details

Details for the file grplist-0.1.5.tar.gz.

File metadata

Download URL: grplist-0.1.5.tar.gz
Upload date: May 11, 2026
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for grplist-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`56d0a8bcce6e45efd7d155c439b62f882e183451ebdffa3c9782bb42d3208dfb`
MD5	`303e64c9fe2a063b1fd6fc25a734dedb`
BLAKE2b-256	`675a7ad49dd7303a56252e3d270615eb4e471a2bcf9c88daf37e2cd10aff4d45`

See more details on using hashes here.

File details

Details for the file grplist-0.1.5-py3-none-any.whl.

File metadata

Download URL: grplist-0.1.5-py3-none-any.whl
Upload date: May 11, 2026
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for grplist-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94794c9e9361db29ffdf7dffd0cc570efd55d47f514d376952a89f64c664b4ec`
MD5	`914660c4dce81e9f826759f9d2e5aed6`
BLAKE2b-256	`9eac9c3e0c482f9cd75bb97568254300cd0ed679a3a2e1e10cb3538517bb2936`

See more details on using hashes here.

grplist 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

grplist

Table of contents

Install

Quick start

API reference

Parameters

Return value

Choosing between the variants

Examples

Numeric proximity

Grouping by shared characters

Grouping dict items by value

Grouping objects with a custom predicate — overlapping time ranges

How it works

The underlying problem: connected components

groupList — group-ID scan

groupList2 — chain-walk comparison

groupList3 — Union-Find merge

groupList4 — chain-walk + tail-pointer merge (recommended)

Performance guide

Further reading

Comparison to similar projects

NetworkX — nx.connected_components(G)

SciPy — scipy.sparse.csgraph.connected_components

scikit-learn — AgglomerativeClustering(linkage='single')

disjoint-set — PyPI package disjoint-set

Summary table

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`groupList` — group-ID scan

`groupList2` — chain-walk comparison

`groupList3` — Union-Find merge

`groupList4` — chain-walk + tail-pointer merge (recommended)

NetworkX — `nx.connected_components(G)`

SciPy — `scipy.sparse.csgraph.connected_components`

scikit-learn — `AgglomerativeClustering(linkage='single')`

disjoint-set — PyPI package `disjoint-set`