Skip to main content

Incremental hierarchical clustering tree over content embeddings

Project description

embed-tree

embed-tree turns content nodes into a browsable, labeled hierarchy.

The public model is intentionally small:

ContentNode(id=..., text=..., metadata={...})
BranchNode(id, label=None, children=[])
EmbedTree(embedder, config=None, state=None, labeler=None)

ContentNode.text is the string passed to the embedder. metadata is opaque user data returned by queries and preserved in exported branches.

Install

pip install embed-tree

Optional integrations:

pip install "embed-tree[openai]"
pip install "embed-tree[local]"
pip install "embed-tree[sql]"

Quick Start

from embed_tree import ContentNode, EmbedTree, TagSetEmbedder, TreeConfig

nodes = [
    ContentNode(id="doc-1", text="import pipeline docs", metadata={"tags": ["docs", "ingest"]}),
    ContentNode(id="doc-2", text="retry handling for ingestion", metadata={"tags": ["ingest"]}),
    ContentNode(id="doc-3", text="summary generation latency", metadata={"tags": ["analysis"]}),
    ContentNode(id="doc-4", text="schema mapping examples", metadata={"tags": ["docs", "schemas"]}),
]

tree = EmbedTree(
    embedder=TagSetEmbedder(["docs", "ingest", "analysis", "schemas"]),
    config=TreeConfig(max_branches=4, leaf_capacity=2),
)

tree.add_nodes(nodes)
tree.organize()  # rebalance the hierarchy, then label each branch

print(tree.show())
branch = tree.to_branch()

Use a real text embedder in production:

from embed_tree import ContentNode, EmbedTree, OpenAITextEmbedder

tree = EmbedTree(OpenAITextEmbedder(model="text-embedding-3-small", api_key="..."))
tree.add_node(
    ContentNode(
        id="doc-1",
        text="Some document summary",
        metadata={"source": "docs"},
    )
)

Core API

tree.add_node(ContentNode(...))      # -> id
tree.add_nodes([ContentNode(...)])   # -> list[id]
tree.add_branch(BranchNode(...))     # -> list[id], inserts all content leaves

tree.query("query text", k=10, exhaustive=False)
tree.remove(node_id)
tree.remove_batch([node_id])

tree.rebalance()
tree.label(labeler=None)
tree.organize(labeler=None) # rebalance + re-label

tree.to_branch(max_items=None)
tree.show(max_items=3)
len(tree)

BranchNode is the public tree shape. It can represent an input branch from a loader or the organized output from EmbedTree.to_branch().

For folder-based trees, FileSystemTreeLoader uses the file content MD5 as id. Its optional text_generator(path, raw_text) can derive the embed text from raw file text while preserving file identity. Its optional additional_metadata_derivers is a list of callables that derive metadata such as new_file_name from file content; derived dictionaries are merged in order, with later keys winning. FolderTreePersister moves existing files only when a node has a content MD5 as its id or explicit MD5 metadata and that MD5 exists under the current root. If no current file matches, path metadata can point to a source file to copy when its MD5 matches the same identity. If neither exists, missing_node_file controls the result: "skip" warns and skips by default, "create" writes a .txt snapshot containing text and metadata, and "raise" raises MissingNodeFileError. new_file_name can rename moved/copied files or snapshots.

EmbedTree has internal runtime nodes and content records which are not public API.

Persistence

Use a state loader that can save materialized state:

from embed_tree import EmbedTree, JsonTreeLoader

tree = EmbedTree(embedder, state=JsonTreeLoader("./tree.json"))

Development

uv sync --extra dev
uv run --extra dev pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_tree-0.1.2.tar.gz (158.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embed_tree-0.1.2-py3-none-any.whl (39.3 kB view details)

Uploaded Python 3

File details

Details for the file embed_tree-0.1.2.tar.gz.

File metadata

  • Download URL: embed_tree-0.1.2.tar.gz
  • Upload date:
  • Size: 158.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.1.2.tar.gz
Algorithm Hash digest
SHA256 ea7e98d1e1cbf3de63afa70119b8878d09bb4a321ddf5cad383b4bd5e06a1281
MD5 9721a421c9e96422fd5884f4031333cd
BLAKE2b-256 16cac12b1ea6ce49b5e51016cca2b87eea6e5544fc377670992b686aa1b38e12

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.1.2.tar.gz:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embed_tree-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: embed_tree-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 39.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 39b91439b53d509f109882af95551096b50323033c8711279ad2ea6aa7807cf6
MD5 25f0a684ab218fb6e1887408862f6096
BLAKE2b-256 140d505d4b7d1f5d00ee307e78cf1c7eb5591c8dcdc7e8479d30d0ed47c3fd8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page