Skip to main content

Incremental hierarchical clustering tree over content embeddings

Project description

embed-tree

embed-tree turns content embeddings into a browsable, labeled hierarchy. It is useful when you have documents, notes, tickets, search results, or any other text-like records and want a small taxonomy that a person can inspect.

The library stays model-agnostic: you provide an embedder, and embed-tree handles clustering, labeling, querying, deletion, and persistence.

Install

pip install embed-tree

Optional adapters:

pip install "embed-tree[openai]"  # OpenAI embeddings
pip install "embed-tree[local]"   # sentence-transformers embeddings
pip install "embed-tree[sql]"     # SQLAlchemy loaders/persisters

Quick Start

from embed_tree import EmbedTree, FakeEmbeddingProvider, TreeConfig

embedder = FakeEmbeddingProvider(dim=32)  # deterministic demo embedder
tree = EmbedTree(
    embedder=embedder,
    config=TreeConfig(max_branches=5, leaf_capacity=10),
)

tree.add_batch(
    [
        "Write import pipeline documentation",
        "Fix login session refresh",
        "Reduce report query latency",
        "Add retry handling to data ingestion",
    ]
)

tree.organize()       # rebuild a clean hierarchy and label every node
print(tree.show())    # human-readable outline

Use a real embedding provider in production:

from embed_tree import EmbedTree, OpenAIEmbeddingProvider

embedder = OpenAIEmbeddingProvider(
    model="text-embedding-3-small",
    api_key="...",
)

tree = EmbedTree(embedder=embedder)
tree.add("Some document text", payload={"source": "docs"})

Core API

EmbedTree is the main entry point.

tree.add(content, item_id=None, payload=None, text=None)
tree.add_batch(contents, item_ids=None, payloads=None, texts=None)
tree.add_node(content_node)
tree.add_nodes(content_nodes)
tree.add_partial_tree(partial_tree)

tree.organize(tagger=None)
tree.rebalance()
tree.label(tagger=None)

tree.query(content, k=10, exhaustive=False)
tree.remove(item_id)
tree.remove_batch(item_ids)

tree.show(max_items=3)
tree.to_dict(max_items=5)
tree.get_tree()
len(tree)

content is what gets embedded. text is the human-readable string used in labels and browse output; it defaults to content when content is a string. payload is returned in query results and exported browse data.

Configuration

Configuration is explicit and code-driven through TreeConfig. It does not read environment variables.

from embed_tree import LLMConfig, RebalanceConfig, TreeConfig

config = TreeConfig(
    max_branches=5,
    leaf_capacity=10,
    rebalance=RebalanceConfig(enabled=True, every_n_inserts=10_000),
    llm=LLMConfig(provider="none"),  # default keyword labels, no network
)

Defaults are tuned for readable taxonomies: small fan-out and small leaves. Raise max_branches and leaf_capacity when using the tree primarily as a retrieval index.

Querying

hits = tree.query("related content", k=5)
# [(item_id, distance, payload), ...]

exact_hits = tree.query("related content", k=5, exhaustive=True)

Default queries route to one leaf and rank items there, which is fast but approximate. exhaustive=True scans every item for exact nearest neighbors.

Persistence

from embed_tree import EmbedTree, FileTreeStore

tree = EmbedTree(
    embedder=embedder,
    store=FileTreeStore("./tree.json"),
)

FileTreeStore saves an atomic JSON snapshot after writes and reloads it when the tree is constructed again.

Labeling

Without extra configuration, node labels are generated locally from keywords. For LLM labels:

from embed_tree import LLMConfig, TreeConfig

config = TreeConfig(
    llm=LLMConfig(provider="openai", model="gpt-4o-mini", api_key="...")
)

You can also pass a custom tagger: Callable[[list[str]], str] to EmbedTree(..., tagger=...), tree.label(tagger=...), or tree.organize(tagger=...).

More Documentation

See docs/API.md for the fuller API reference, provider details, loader/persister abstractions, PCA options, and extension points.

Development

uv sync --extra dev
uv run pytest -q

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embed_tree-0.0.6.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embed_tree-0.0.6-py3-none-any.whl (47.0 kB view details)

Uploaded Python 3

File details

Details for the file embed_tree-0.0.6.tar.gz.

File metadata

  • Download URL: embed_tree-0.0.6.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.0.6.tar.gz
Algorithm Hash digest
SHA256 b7895e823832b56e3668078938769943ceed881ab4ae70ca9e85e17fa8e85048
MD5 6ad311d88c30b855c6ee5214f57b1a54
BLAKE2b-256 60fc5cbbf933e3bac0419253dbd2b29a5036f88c028c0a4258761c38c317b397

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.0.6.tar.gz:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file embed_tree-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: embed_tree-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 47.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for embed_tree-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 cc640a5160e189f05529f477fa8630c2d55d63d9380c20573ff413ae5a3aa741
MD5 47204bb603c1c9e588d383e4ab772b5a
BLAKE2b-256 c4da771a77dc78a19411504676663e1a7f7fb276933b5e538edff152dd1830bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for embed_tree-0.0.6-py3-none-any.whl:

Publisher: publish.yml on Arnoldosmium/embed-tree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page