Find the Smallest Unique Subset (SUS)
Project description
Find the Smallest Unique Subset (SUS), fast
Current version:
0.5.2
Please note that this software is still in development. It is supposed to work as intended in the current state and you are welcome to use it. The main changes expected before version
1.0.0
should involve documentation, code cleanup, and more flexible input options.
Installation
PyPI
pip install triesus
From source
python -m build
pip install dist/triesus-*.whl
Usage
Consider the following collection of sets. The first field denotes the set id, the following fields include the elements of the set. Fields are tab-separated:
1 C
2 A B D
3 B E
4 A D
The smallest unique subset (SUS) for each set in this collection can be found running:
triesus tests/examples/sets2.tsv
This will print in STDOUT
:
1 C
2 D B
3 E
4
Note that the SUS for set 4
does not exist.
Performance
Below it's displayed the performance of TrieSUS compared with the brute force baseline on randomly generated set collections. This analysis can be reproduced by cloning the TrieSUS benchmark repository.
The input size parameter is a number corresponding to three quantities: the number of sets in the collection, the number of items in the universe, and the maximum number of items in each set.
Algorithm
Given a collection of sets, TrieSUS maps each set to the smallest possible combination of its elements that uniquely identifies the set, here called a smallest unique subset, or SUS.
Brute force approach
To find the SUS of a set, a naive brute force approach would involve enumerating the non-empty subsets of the set from the smallest to the largest (for example using the Gosper's Hack, as implemented in naive_sus.find_sus()
), until finding one that is a solution, i.e. that is not a subset of any other set in the collection. This approach, even after taking some relatively obvious precautions such as making sure that potential solutions containing the least frequent elements in the collection are tested first, scales very badly with input size (see Performance). In fact, it has exponential time complexity of $O(2^n*m)$, where $n$ is the number of elements in each set (assuming all sets have equal size) and $m$ is the number of sets in the collection.
TrieSUS
TrieSUS implements a series of linear-time operations to first greatly reduce the problem size, and to eventually transform it into the equivalent of a set cover problem. The set cover is an NP-hard problem, but because of the reduced size of the input it will be treatable. TrieSUS uses Google's OR-Tools constraint programming to solve it.
The algorithm starts by first ranking all the elements found in the sets of the collection from the most frequent to the least frequent, and sorting the elements in the sets according to these ranks. Each set is then treated as a sorted list to build a prefix tree (trie), where each leaf corresponds to a set. The construction of such a trie is not a strictly necessary step, but it may have advantages depending on the specific input. It can marginally reduce the number of operations in the following steps and it allows to immediately identify sets of the collection for which a solution doesn't exist, i.e. sets that don't have a SUS. Regardless, the construction of the trie is linear in time complexity and therefore doesn't significantly impact the performance of the algorithm.
The main part of the algorithm then performs a series of operations on the trie to find a SUS, if it exists, for each one of the sets, as implemented in triesus.TrieSUS.find_sus()
. This method leverages the trie structure to efficiently track unique symbols among different words and applies a cover set solution to determine the smallest set of symbols that cover these unique items across different subsets within the trie.
The following is a high-level breakdown of the steps taken within the find_sus()
method to identify the SUS for a set, here represented by a given "word" of ordered symbols:
-
Identify the end node for the given word:
- Determine the end node within the trie structure that represents the given word.
-
Gather the other end nodes:
- Collect all other end nodes within the trie that don't represent the given word. These nodes correspond to other words (sets) stored in the trie.
-
Check conditions for SUS existence:
- Verify conditions to determine if a SUS exists for the given word:
- If the end node of the given word has children or if there's more than one identical word in the trie, then a SUS doesn't exist. The method returns an empty list in this case.
- Verify conditions to determine if a SUS exists for the given word:
-
Identify unique symbols for each other end node:
- For each of the other end nodes collected earlier:
- Trace back from the end node of the given word to the common ancestor node shared with the other end node
- Along this path, collect symbols unique to the given word that are not present in the other word.
- For each of the other end nodes collected earlier:
-
Assemble candidate symbols:
- Compile the sets of unique symbols obtained from the previous step into a list of sets.
-
Construct a dictionary of sets to cover:
- Transform the list of sets into a dictionary where keys are items (symbols) of the sets and values are indexes of the sets. This prepares the data structure to solve the cover set problem. Finding the cover set on the indexes ensures that a minimum amount of candidate unique symbols is used to discriminate the given set from all the other sets of the collection.
-
Solve the cover set problem:
- Use the OR-Tools constraint programming solver to find a solution. The solution [...] is a SUS of the given set.
-
Return the result:
- Return the identified SUS or an empty list if no unique subset satisfying the conditions is identified.
The unique subset problem in the wild
The algorithmic problem that TrieSUS is intended to solve is discussed in several occasions, of which some are highlighted here:
-
In this Stack Overflow question (Aug 21, 2020), a solution was proposed to find the SUS of a set
s
by first counting how many times each element ins
appears in the other sets. It then sorts the elements ins
based on this count, and considers all possible sub-arrays ofs
, returning the first one that satisfies the condition that it appears in exactly one set in the collection. If no such subset is found, the function returnsnull
. This procedure is repeated for every set in the collection. -
The same problem also appears in this Stack Exchange question (Sep 19, 2017), where no answer was given at the time of writing.
-
The code implementing a brute force approach that compares all the powersets was shared in an answer to this Stack Overflow question (Jan 26, 2018).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file triesus-0.5.2.tar.gz
.
File metadata
- Download URL: triesus-0.5.2.tar.gz
- Upload date:
- Size: 476.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 37aa1c6223751e45f3b7d9fc2b96fb90e8cb1870b994ed174d147e2fa00e09c6 |
|
MD5 | 1d2bb0fc8c08e71c38bce28f85f1cf68 |
|
BLAKE2b-256 | 28d727033d25c4837ac2d01e062fe09e2f3e701ae2f82bbd9c573e679763665d |
File details
Details for the file triesus-0.5.2-py3-none-any.whl
.
File metadata
- Download URL: triesus-0.5.2-py3-none-any.whl
- Upload date:
- Size: 472.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fcac4328b0db0200b0c61e2d6718468962a5b036f01c49540b557b28347751c3 |
|
MD5 | 5e1fbaf1d2943f4d8ed2396793a563b8 |
|
BLAKE2b-256 | 25ac408314d0bb0ee4fdccfba4c017e70aaf60cbb6d65c1da6e98b74392495d1 |