Skip to main content

Incremental hierarchical clustering tree over content embeddings

Project description

embed-tree

embed-tree turns content nodes into a browsable, labeled hierarchy.

The public model is intentionally small:

ContentNode(id=..., text=..., metadata={...})
BranchNode(id, label=None, children=[])
EmbedTree(embedder, config=None, state=None, labeler=None)

ContentNode.text is the string passed to the embedder. metadata is opaque user data returned by queries and preserved in exported branches.

Install

pip install embed-tree

Optional integrations:

pip install "embed-tree[openai]"
pip install "embed-tree[local]"
pip install "embed-tree[sql]"

Quick Start

from embed_tree import ContentNode, EmbedTree, TagSetEmbedder, TreeConfig

nodes = [
    ContentNode(id="doc-1", text="import pipeline docs", metadata={"tags": ["docs", "ingest"]}),
    ContentNode(id="doc-2", text="retry handling for ingestion", metadata={"tags": ["ingest"]}),
    ContentNode(id="doc-3", text="summary generation latency", metadata={"tags": ["analysis"]}),
    ContentNode(id="doc-4", text="schema mapping examples", metadata={"tags": ["docs", "schemas"]}),
]

tree = EmbedTree(
    embedder=TagSetEmbedder(["docs", "ingest", "analysis", "schemas"]),
    config=TreeConfig(max_branches=4, leaf_capacity=2),
)

tree.add_nodes(nodes)
tree.organize()  # rebalance the hierarchy, then label each branch

print(tree.show())
branch = tree.to_branch()

Use a real text embedder in production:

from embed_tree import ContentNode, EmbedTree, OpenAITextEmbedder

tree = EmbedTree(OpenAITextEmbedder(model="text-embedding-3-small", api_key="..."))
tree.add_node(
    ContentNode(
        id="doc-1",
        text="Some document summary",
        metadata={"source": "docs"},
    )
)

Core API

tree.add_node(ContentNode(...))      # -> id
tree.add_nodes([ContentNode(...)])   # -> list[id]
tree.add_branch(BranchNode(...))     # -> list[id], inserts all content leaves

tree.query("query text", k=10, exhaustive=False)
tree.remove(node_id)
tree.remove_batch([node_id])

tree.rebalance()
tree.label(labeler=None)
tree.organize(labeler=None) # rebalance + re-label

tree.to_branch(max_items=None)
tree.show(max_items=3)
len(tree)

BranchNode is the public tree shape. It can represent an input branch from a loader or the organized output from EmbedTree.to_branch().

For folder-based trees, FileSystemTreeLoader uses the file content MD5 as id. Its optional text_generator(path, raw_text) can derive the embed text from raw file text while preserving file identity. FolderTreePersister moves existing files only when a node has a content MD5 as its id or explicit MD5 metadata and that MD5 exists under the current root. If no current file matches, path metadata can point to a source file to copy when its MD5 matches the same identity. If neither exists, missing_node_file controls the result: "skip" warns and skips by default, "create" writes a .txt snapshot containing text and metadata, and "raise" raises MissingNodeFileError. new_file_name can rename moved/copied files or snapshots.

EmbedTree has internal runtime nodes and content records which are not public API.

Persistence

Use a state loader that can save materialized state:

from embed_tree import EmbedTree, JsonTreeLoader

tree = EmbedTree(embedder, state=JsonTreeLoader("./tree.json"))

Development

uv sync --extra dev
uv run --extra dev pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_tree-0.1.0.tar.gz (36.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embed_tree-0.1.0-py3-none-any.whl (39.0 kB view details)

Uploaded Python 3

File details

Details for the file embed_tree-0.1.0.tar.gz.

File metadata

  • Download URL: embed_tree-0.1.0.tar.gz
  • Upload date:
  • Size: 36.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.1.0.tar.gz
Algorithm Hash digest
SHA256 523974615ecf64c8e4ff2ab74df84d28ce592e272a3ca683c728d65d1dc2bd7c
MD5 b8ce5a1366e8347cdcff37b83365c692
BLAKE2b-256 d820685ee838ae3c41560cca22b9965b16d40a795081302d2217778030fcb6bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.1.0.tar.gz:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embed_tree-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: embed_tree-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 39.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 151ade73ead9735ec259884fb6288d71a3b358b9ad3cf43acd35b205807ae3f4
MD5 44f1e4ad447f3ee1da0ac1a2faaab0bc
BLAKE2b-256 a3fc4661456fc72ca446fcd3500aec5d506b5f8197e4d3b184b890563dc5f4a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page