Compute similarity between netsted set based trees.
Project description
Compute similarity between trees, e.g. dependency trees
Convert an Adjacency List into a Nested Set Table
For example, CoNLL-U’s ['id', 'head'] fields form an adjacency list of a dependency tree. Traversing an adjacency list is slower than reading a nested set. Thus, converting a adjacency list to a nested set table once, makes sense if we need to read the three several times lateron.
import treesimi as ts
adjac = [(1, 0), (2, 1), (3, 1), (4, 2)]
nested = ts.adjac_to_nested(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
Demo: Query a nested set table
To extract a subtree we just need to iterate through the list ($O(n)$)
_, lft0, rgt0, _ = nested[1]
subtree = [(i, l, r, d) for i, l, r, d in nested if (l >= lft0) and (r <= rgt0)]
# [[2, 2, 5, 1], [4, 3, 4, 2]]
or ts.get_subtree(nested, sub_id=2)
Set node attributes
import treesimi as ts
nested = [[1, 1, 8, 0], [2, 2, 5, 1], [4, 3, 4, 2], [3, 6, 7, 1]]
attrs = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
nested = ts.set_attr(nested, attrs)
# columns: node id, left, right, depth, attributes
# [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
Convert Adjacency List with attributes
import treesimi as ts
adjac = [(1, 0, 'dat1'), (2, 1, 'dat2'), (3, 1, 'dat3'), (4, 2, 'dat3')]
nested = ts.adjac_to_nested_with_attr(adjac)
# columns: node id, left, right, depth
# [[1, 1, 8, 0, 'dat1'], [2, 2, 5, 1, 'dat2'], [4, 3, 4, 2, 'dat2'], [3, 6, 7, 1, 'dat4']]
Extract subtree patterns
We can extract the following patterns from one tree:
Depth dimension
Full subtrees
Truncate leaves
Sibling dimension
All siblings
Drop siblings (and their subtree)
Placeholder attribute field
Full subtrees
The function extract_subtrees returns all subtrees of a tree. The depth information is adjusted accordingly for each subtree.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.extract_subtrees(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 4, 0, 'b'], [2, 3, 1, 'd']],
# [[1, 2, 0, 'd']],
# [[1, 2, 0, 'c']]
# ]
Truncate leaves
In the first step, the function trunc_leaves removes leaves of the largest depth level. The result is always an incomplete tree, and the lft and rgt values are not adjusted to indicate that there is a missing node. In the next steps, the depth level is further removed down to depth=1.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.trunc_leaves(nested)
# [
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [6, 7, 1, 'c']]
# ]
Hint: Run trunc_leaves for each subtree extracted by extract_subtrees. Call unique_trees after each step.
Drop sibling nodes
Generate variants of a tree by dropping each node once. Again, the result is always an incomplete tree, and the lft and rgt values are not adjusted to indicate that there is a missing node.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.drop_nodes(nested)
# [
# [[1, 8, 0, 'a']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b']],
# [[1, 8, 0, 'a']]
# ]
Hints: Create subtrees with extract_subtrees and trunc_leaves, and run drop_nodes on these subtrees. If you want to drop N nodes/leaves of a tree, then call the function twice, e.g. drop_nodes(drop_nodes(...)).
Placeholder attribute field
The replace_attr removes the data attribute of a node with a generic placeholder.
import treesimi as ts
nested = [[1, 1, 8, 0, 'a'], [2, 2, 5, 1, 'b'], [4, 3, 4, 2, 'd'], [3, 6, 7, 1, 'c']]
nested = ts.remove_node_ids(nested)
subtrees = ts.replace_attr(nested, placeholder='[MASK]')
# [
# [[1, 8, 0, '[MASK]'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, '[MASK]'], [3, 4, 2, 'd'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, '[MASK]'], [6, 7, 1, 'c']],
# [[1, 8, 0, 'a'], [2, 5, 1, 'b'], [3, 4, 2, 'd'], [6, 7, 1, '[MASK]']]
# ]
Demo Notebooks about Shingling for MinHash
Start jupyter to run the demo notebook
source .venv/bin/activate
jupyter lab
Appendix
Installation
The treesimi git repo is available as PyPi package
pip install treesimi
pip install git+ssh://git@github.com/ulf1/treesimi.git
Commands
Install a virtual environment
python3.6 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don’t use the subfolder .venv. Use an absolute path without whitespaces.)
Python commands
Check syntax: flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
Run Unit Tests: pytest
Upload to PyPi with twine: python setup.py sdist && twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Support
Please open an issue for support.
Contributing
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.