Incremental hierarchical clustering tree over content embeddings
Project description
embed-tree
embed-tree turns content nodes into a browsable, labeled hierarchy.
The public model is intentionally small:
ContentNode(id=..., text=..., metadata={...})
BranchNode(id, label=None, children=[])
EmbedTree(embedder, config=None, state=None, labeler=None)
ContentNode.text is the string passed to the embedder. metadata is opaque
user data returned by queries and preserved in exported branches.
Install
pip install embed-tree
Optional integrations:
pip install "embed-tree[openai]"
pip install "embed-tree[local]"
pip install "embed-tree[sql]"
Quick Start
from embed_tree import ContentNode, EmbedTree, TagSetEmbedder, TreeConfig
nodes = [
ContentNode(id="doc-1", text="import pipeline docs", metadata={"tags": ["docs", "ingest"]}),
ContentNode(id="doc-2", text="retry handling for ingestion", metadata={"tags": ["ingest"]}),
ContentNode(id="doc-3", text="summary generation latency", metadata={"tags": ["analysis"]}),
ContentNode(id="doc-4", text="schema mapping examples", metadata={"tags": ["docs", "schemas"]}),
]
tree = EmbedTree(
embedder=TagSetEmbedder(["docs", "ingest", "analysis", "schemas"]),
config=TreeConfig(max_branches=4, leaf_capacity=2),
)
tree.add_nodes(nodes)
tree.organize() # rebalance the hierarchy, then label each branch
print(tree.show())
branch = tree.to_branch()
Use a real text embedder in production:
from embed_tree import ContentNode, EmbedTree, OpenAITextEmbedder
tree = EmbedTree(OpenAITextEmbedder(model="text-embedding-3-small", api_key="..."))
tree.add_node(
ContentNode(
id="doc-1",
text="Some document summary",
metadata={"source": "docs"},
)
)
Core API
tree.add_node(ContentNode(...)) # -> id
tree.add_nodes([ContentNode(...)]) # -> list[id]
tree.add_branch(BranchNode(...)) # -> list[id], inserts all content leaves
tree.query("query text", k=10, exhaustive=False)
tree.remove(node_id)
tree.remove_batch([node_id])
tree.rebalance()
tree.label(labeler=None)
tree.organize(labeler=None) # rebalance + re-label
tree.to_branch(max_items=None)
tree.show(max_items=3)
len(tree)
BranchNode is the public tree shape. It can represent an input branch from a
loader or the organized output from EmbedTree.to_branch().
For folder-based trees, FileSystemTreeLoader uses the file content MD5 as
id. Its optional text_generator(path, raw_text) can derive the embed text
from raw file text while preserving file identity. Its optional
additional_metadata_derivers is a list of callables that derive metadata such
as new_file_name from file content; derived dictionaries are merged in order,
with later keys winning. FolderTreePersister moves existing files only when a
node has a content MD5 as its id or explicit MD5 metadata and that MD5 exists
under the current root. If no current file matches, path metadata can point to a
source file to copy when its MD5 matches the same identity. If neither exists,
missing_node_file controls the result: "skip" warns and skips by default,
"create" writes a .txt snapshot containing text and metadata, and
"raise" raises MissingNodeFileError. new_file_name can rename moved/copied
files or snapshots.
EmbedTree has internal runtime nodes and content records which are not
public API.
Persistence
Use a state loader that can save materialized state:
from embed_tree import EmbedTree, JsonTreeLoader
tree = EmbedTree(embedder, state=JsonTreeLoader("./tree.json"))
Development
uv sync --extra dev
uv run --extra dev pytest -q
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embed_tree-0.1.2.tar.gz.
File metadata
- Download URL: embed_tree-0.1.2.tar.gz
- Upload date:
- Size: 158.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea7e98d1e1cbf3de63afa70119b8878d09bb4a321ddf5cad383b4bd5e06a1281
|
|
| MD5 |
9721a421c9e96422fd5884f4031333cd
|
|
| BLAKE2b-256 |
16cac12b1ea6ce49b5e51016cca2b87eea6e5544fc377670992b686aa1b38e12
|
Provenance
The following attestation bundles were made for embed_tree-0.1.2.tar.gz:
Publisher:
publish.yml on Arnoldosmium/embed-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embed_tree-0.1.2.tar.gz -
Subject digest:
ea7e98d1e1cbf3de63afa70119b8878d09bb4a321ddf5cad383b4bd5e06a1281 - Sigstore transparency entry: 1803770101
- Sigstore integration time:
-
Permalink:
Arnoldosmium/embed-tree@c7c16492385f11b5706d2a2e01f502954c7987d0 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/Arnoldosmium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7c16492385f11b5706d2a2e01f502954c7987d0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file embed_tree-0.1.2-py3-none-any.whl.
File metadata
- Download URL: embed_tree-0.1.2-py3-none-any.whl
- Upload date:
- Size: 39.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39b91439b53d509f109882af95551096b50323033c8711279ad2ea6aa7807cf6
|
|
| MD5 |
25f0a684ab218fb6e1887408862f6096
|
|
| BLAKE2b-256 |
140d505d4b7d1f5d00ee307e78cf1c7eb5591c8dcdc7e8479d30d0ed47c3fd8c
|
Provenance
The following attestation bundles were made for embed_tree-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on Arnoldosmium/embed-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embed_tree-0.1.2-py3-none-any.whl -
Subject digest:
39b91439b53d509f109882af95551096b50323033c8711279ad2ea6aa7807cf6 - Sigstore transparency entry: 1803770117
- Sigstore integration time:
-
Permalink:
Arnoldosmium/embed-tree@c7c16492385f11b5706d2a2e01f502954c7987d0 -
Branch / Tag:
refs/tags/0.1.2 - Owner: https://github.com/Arnoldosmium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c7c16492385f11b5706d2a2e01f502954c7987d0 -
Trigger Event:
push
-
Statement type: