Incremental hierarchical clustering tree over content embeddings
Project description
embed-tree
embed-tree turns content embeddings into a browsable, labeled hierarchy.
It is useful when you have documents, notes, tickets, search results, or any
other text-like records and want a small taxonomy that a person can inspect.
The library stays model-agnostic: you provide an embedder, and embed-tree
handles clustering, labeling, querying, deletion, and persistence.
Install
pip install embed-tree
Optional adapters:
pip install "embed-tree[openai]" # OpenAI embeddings
pip install "embed-tree[local]" # sentence-transformers embeddings
pip install "embed-tree[sql]" # SQLAlchemy loaders/persisters
Quick Start
from embed_tree import EmbedTree, FakeEmbeddingProvider, TreeConfig
embedder = FakeEmbeddingProvider(dim=32) # deterministic demo embedder
tree = EmbedTree(
embedder=embedder,
config=TreeConfig(max_branches=5, leaf_capacity=10),
)
tree.add_batch(
[
"Write import pipeline documentation",
"Fix login session refresh",
"Reduce report query latency",
"Add retry handling to data ingestion",
]
)
tree.organize() # rebuild a clean hierarchy and label every node
print(tree.show()) # human-readable outline
Use a real embedding provider in production:
from embed_tree import EmbedTree, OpenAIEmbeddingProvider
embedder = OpenAIEmbeddingProvider(
model="text-embedding-3-small",
api_key="...",
)
tree = EmbedTree(embedder=embedder)
tree.add("Some document text", payload={"source": "docs"})
Core API
EmbedTree is the main entry point.
tree.add(content, item_id=None, payload=None, text=None)
tree.add_batch(contents, item_ids=None, payloads=None, texts=None)
tree.add_node(content_node)
tree.add_nodes(content_nodes)
tree.add_partial_tree(partial_tree)
tree.organize(tagger=None)
tree.rebalance()
tree.label(tagger=None)
tree.query(content, k=10, exhaustive=False)
tree.remove(item_id)
tree.remove_batch(item_ids)
tree.show(max_items=3)
tree.to_dict(max_items=5)
tree.get_tree()
len(tree)
content is what gets embedded. text is the human-readable string used in
labels and browse output; it defaults to content when content is a string.
payload is returned in query results and exported browse data.
Configuration
Configuration is explicit and code-driven through TreeConfig. It does not
read environment variables.
from embed_tree import LLMConfig, RebalanceConfig, TreeConfig
config = TreeConfig(
max_branches=5,
leaf_capacity=10,
rebalance=RebalanceConfig(enabled=True, every_n_inserts=10_000),
llm=LLMConfig(provider="none"), # default keyword labels, no network
)
Defaults are tuned for readable taxonomies: small fan-out and small leaves.
Raise max_branches and leaf_capacity when using the tree primarily as a
retrieval index.
Querying
hits = tree.query("related content", k=5)
# [(item_id, distance, payload), ...]
exact_hits = tree.query("related content", k=5, exhaustive=True)
Default queries route to one leaf and rank items there, which is fast but
approximate. exhaustive=True scans every item for exact nearest neighbors.
Persistence
from embed_tree import EmbedTree, FileTreeStore
tree = EmbedTree(
embedder=embedder,
store=FileTreeStore("./tree.json"),
)
FileTreeStore saves an atomic JSON snapshot after writes and reloads it when
the tree is constructed again.
Labeling
Without extra configuration, node labels are generated locally from keywords. For LLM labels:
from embed_tree import LLMConfig, TreeConfig
config = TreeConfig(
llm=LLMConfig(provider="openai", model="gpt-4o-mini", api_key="...")
)
You can also pass a custom tagger: Callable[[list[str]], str] to
EmbedTree(..., tagger=...), tree.label(tagger=...), or
tree.organize(tagger=...).
More Documentation
See docs/API.md for the fuller API reference, provider details, loader/persister abstractions, PCA options, and extension points.
Development
uv sync --extra dev
uv run pytest -q
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embed_tree-0.0.6.tar.gz.
File metadata
- Download URL: embed_tree-0.0.6.tar.gz
- Upload date:
- Size: 42.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7895e823832b56e3668078938769943ceed881ab4ae70ca9e85e17fa8e85048
|
|
| MD5 |
6ad311d88c30b855c6ee5214f57b1a54
|
|
| BLAKE2b-256 |
60fc5cbbf933e3bac0419253dbd2b29a5036f88c028c0a4258761c38c317b397
|
Provenance
The following attestation bundles were made for embed_tree-0.0.6.tar.gz:
Publisher:
publish.yml on Arnoldosmium/embed-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embed_tree-0.0.6.tar.gz -
Subject digest:
b7895e823832b56e3668078938769943ceed881ab4ae70ca9e85e17fa8e85048 - Sigstore transparency entry: 1792842267
- Sigstore integration time:
-
Permalink:
Arnoldosmium/embed-tree@ae9030f92dab3e0fb7750a278d4cb0675c01e84d -
Branch / Tag:
refs/tags/0.0.6 - Owner: https://github.com/Arnoldosmium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ae9030f92dab3e0fb7750a278d4cb0675c01e84d -
Trigger Event:
push
-
Statement type:
File details
Details for the file embed_tree-0.0.6-py3-none-any.whl.
File metadata
- Download URL: embed_tree-0.0.6-py3-none-any.whl
- Upload date:
- Size: 47.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc640a5160e189f05529f477fa8630c2d55d63d9380c20573ff413ae5a3aa741
|
|
| MD5 |
47204bb603c1c9e588d383e4ab772b5a
|
|
| BLAKE2b-256 |
c4da771a77dc78a19411504676663e1a7f7fb276933b5e538edff152dd1830bd
|
Provenance
The following attestation bundles were made for embed_tree-0.0.6-py3-none-any.whl:
Publisher:
publish.yml on Arnoldosmium/embed-tree
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
embed_tree-0.0.6-py3-none-any.whl -
Subject digest:
cc640a5160e189f05529f477fa8630c2d55d63d9380c20573ff413ae5a3aa741 - Sigstore transparency entry: 1792842390
- Sigstore integration time:
-
Permalink:
Arnoldosmium/embed-tree@ae9030f92dab3e0fb7750a278d4cb0675c01e84d -
Branch / Tag:
refs/tags/0.0.6 - Owner: https://github.com/Arnoldosmium
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ae9030f92dab3e0fb7750a278d4cb0675c01e84d -
Trigger Event:
push
-
Statement type: