Compute similarity between netsted set based trees.
Project description
treesimi: Shingling for measuring tree similarity
Compute similarity between trees, e.g. dependency trees
Convert an Adjacency List into a Nested Set Table
For example, CoNLL-U's ['id', 'head']
fields form an adjacency list of a dependency tree.
Traversing an adjacency list is slower than reading a nested set.
Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.
import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
Demo: Query a nested set table
To extract a subtree we just need to iterate through the list ($O(n)$)
_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]
or ts.get_subtree(nested, sub_id=2)
Set node attributes
import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
Convert Adjacency List with attributes
import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]
Extract subtree patterns
We can extract the following patterns from one tree:
- Depth dimension
- Full subtrees
- Truncate leaves
- Sibling dimension
- All siblings
- Drop siblings (and their subtree)
- Placeholder attribute field
Full subtrees
The function extract_subtrees
returns all subtrees of a tree.
The depth information is adjusted accordingly for each subtree.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
# [[1, 2, 0, 'd']],
# [[1, 2, 0, 'c']]
# ]
Truncate leaves
In the first step, the function trunc_leaves
removes leaves of the largest depth level.
The result is always an incomplete tree, and the lft
and rgt
values are not adjusted to indicate that there is a missing node.
In the next steps, the depth level is further removed down to depth=1
.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]
Hint: Run trunc_leaves
for each subtree extracted by extract_subtrees
. Call unique_trees
after each step.
Drop sibling nodes
Generate variants of a tree by dropping each node once.
Again, the result is always an incomplete tree, and the lft
and rgt
values are not adjusted to indicate that there is a missing node.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
# [[1, 8, 0, 'a']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
# [[1, 8, 0, 'a']]
# ]
Hints: Create subtrees with extract_subtrees
and trunc_leaves
, and run drop_nodes
on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. drop_nodes(drop_nodes(...))
.
Placeholder attribute field
The replace_attr
removes the data attribute of a node with a generic placeholder.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
# [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]
Demo Notebooks about Shingling for MinHash
We recommend using the mmh3
hash function, and 32 permutations in datasketch.MinHash
.
- Create subtrees as shingle sets
- Jaccard Similarity between Dependency Trees
- Shingle Dependency Trees for datasketch's Minhash
Start jupyter to run the demo notebook
source .venv/bin/activate
jupyter lab
Appendix
Installation
The treesimi
git repo is available as PyPi package
pip install treesimi
pip install git+ssh://git@github.com/ulf1/treesimi.git
Commands
Install a virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
Python commands
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
pytest
Publish
python setup.py sdist
twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Support
Please open an issue for support.
Contributing
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
Acknowledgements
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 433249742 (GU 798/27-1; GE 1119/11-1).
Maintenance
- till 31.Aug.2023 (v0.2.0) the code repository was maintained within the DFG project 433249742
- since 01.Sep.2023 (v0.3.0) the code repository is maintained by Ulf Hamster.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.