Skip to main content

Structure-aware document retrieval without embeddings.

Project description

๐ŸŒEnglish | ๐Ÿ‡จ๐Ÿ‡ณไธญๆ–‡


TreeSearch: Structure-Aware Document Retrieval

PyPI version Downloads License Apache 2.0 python_version GitHub issues Wechat Group

TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. BM25 + LLM reasoning over document tree structures.

Installation

pip install -U pytreesearch

Quick Start

import asyncio
from treesearch import build_index, load_index, Document, search

async def main():
    # Build indexes for markdown files
    await build_index(paths=["docs/*.md"], output_dir="./indexes")

    # Load indexed documents
    import os
    documents = []
    for fp in sorted(os.listdir("./indexes")):
        if not fp.endswith(".json"):
            continue
        data = load_index(os.path.join("./indexes", fp))
        documents.append(Document(
            doc_id=fp, doc_name=data["doc_name"],
            structure=data["structure"],
            doc_description=data.get("doc_description", ""),
        ))

    # Search with Best-First strategy (BM25 + LLM)
    result = await search(query="How does auth work?", documents=documents)
    for doc_result in result.documents:
        for node in doc_result["nodes"]:
            print(f"[{node['score']:.2f}] {node['title']}")

asyncio.run(main())

Set up API key first:

export OPENAI_API_KEY="sk-..."
# Optional: custom endpoint
export OPENAI_BASE_URL="https://your-endpoint/v1"

Why TreeSearch?

Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.

TreeSearch takes a fundamentally different approach โ€” parse documents into tree structures based on their natural heading hierarchy, then use BM25 + LLM reasoning to navigate the tree and find the most relevant sections.

Traditional RAG TreeSearch
Preprocessing Chunk splitting + embedding Parse headings โ†’ build tree
Retrieval Vector similarity search BM25 pre-scoring + LLM tree search
Multi-doc Needs vector DB for routing LLM routes by document descriptions
Structure Lost after chunking Fully preserved as tree hierarchy
Dependencies Vector DB + embedding model LLM only (no embedding, no vector DB)
Zero-cost baseline N/A BM25-only search (no LLM needed)

Key Advantages

  • No vector embeddings โ€” No embedding model to train, deploy, or pay for
  • No chunk splitting โ€” Documents retain their natural heading structure
  • No vector DB โ€” No Pinecone, Milvus, or Chroma to manage
  • Tree-aware retrieval โ€” Heading hierarchy guides search, not arbitrary chunk boundaries
  • BM25 zero-cost baseline โ€” Instant keyword search with no API calls, useful as standalone or pre-filter
  • Budget-controlled LLM calls โ€” Set max LLM calls per query, with early stopping when confidence is high

Features

  • Three-layer search โ€” BM25 pre-scoring โ†’ Best-First tree search โ†’ LLM relevance evaluation
  • Tree-structured indexing โ€” Markdown and plain text documents are parsed into hierarchical trees
  • BM25 node-level index โ€” Structure-aware scoring with hierarchical field weighting (title > summary > body) and ancestor propagation
  • Best-First search (default) โ€” Priority queue driven, deterministic, with early stopping and budget control
  • MCTS search โ€” Monte Carlo Tree Search with LLM as value function
  • LLM single-pass โ€” One LLM call per document for minimal cost
  • Multi-document search โ€” Route queries across document collections via LLM reasoning
  • Chinese + English โ€” Built-in jieba tokenization for Chinese and regex tokenization for English
  • Batch indexing โ€” build_index() supports glob patterns for concurrent multi-file processing
  • Evaluation metrics โ€” Built-in Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
  • Async-first โ€” All core functions are async with sync wrappers available
  • CLI included โ€” treesearch index and treesearch search commands

BM25 Standalone (No LLM Needed)

from treesearch import NodeBM25Index, Document, load_index

data = load_index("indexes/my_doc.json")
doc = Document(doc_id="doc1", doc_name=data["doc_name"], structure=data["structure"])

index = NodeBM25Index([doc])
results = index.search("authentication config", top_k=5)
for r in results:
    print(f"[{r['bm25_score']:.4f}] {r['title']}")

CLI

# Build indexes from glob pattern
treesearch index --paths "docs/*.md" --add-description

# Search with Best-First (default, BM25 + LLM)
treesearch search --index_dir ./indexes/ --query "How does auth work?"

# Search with MCTS strategy
treesearch search --index_dir ./indexes/ --query "deployment" --strategy mcts

# Control LLM budget
treesearch search --index_dir ./indexes/ --query "auth" --max-llm-calls 10

How It Works

Input Documents (MD/TXT)
        โ”‚
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  Indexer  โ”‚  Parse headings โ†’ build tree โ†’ generate summaries
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    (build_index supports glob for batch processing)
        โ”‚  JSON index files
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  search   โ”‚  BM25 pre-score โ†’ route to docs โ†’ tree search
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚  SearchResult
        โ–ผ
  Ranked nodes with scores and text

Layer 1 โ€” BM25 Pre-Scoring: NodeBM25Index scores all tree nodes using structure-aware BM25 with hierarchical field weighting (title > summary > body) and ancestor score propagation. Instant, no LLM needed.

Layer 2 โ€” Best-First Tree Search: BestFirstTreeSearch uses a priority queue to expand the most promising nodes. LLM evaluates each node's relevance (title + summary only). Early stopping when top score drops below threshold.

Layer 3 โ€” Results: Budget-controlled LLM calls with subtree caching for reuse across similar queries.

Search Strategies

Strategy Description LLM Calls Best For
best_first (default) BM25 pre-scoring + priority queue + LLM evaluation Moderate (budget-controlled) General-purpose, best accuracy
mcts Monte Carlo Tree Search with LLM as value function High Complex reasoning queries
llm Single LLM call per document Minimal Low-cost, simple queries
BM25-only NodeBM25Index.search() standalone Zero Instant keyword search, no API key

Examples

Example Description
01_index_and_search.py Single document indexing + BestFirst search
02_text_indexing.py Plain text โ†’ tree index with auto heading detection
03_cli_workflow.py CLI workflow: build indexes + search with strategies
04_multi_doc_search.py Multi-doc BM25 + BestFirst + strategy comparison + Chinese
05_benchmark.py Benchmark: BM25 / BestFirst / MCTS / LLM with metrics

Project Structure

treesearch/
โ”œโ”€โ”€ llm.py            # Async LLM client with retry and JSON extraction
โ”œโ”€โ”€ tree.py           # Document dataclass, tree operations, persistence
โ”œโ”€โ”€ indexer.py        # Markdown / plain text โ†’ tree structure, batch build_index()
โ”œโ”€โ”€ search.py         # Best-First, MCTS, LLM search, document routing, unified search() API
โ”œโ”€โ”€ rank_bm25.py      # BM25Okapi, NodeBM25Index, Chinese/English tokenizer
โ”œโ”€โ”€ metrics.py        # Evaluation: Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
โ””โ”€โ”€ cli.py            # CLI entry point (index / search)

Documentation

Community

  • GitHub Issues โ€” Submit an issue
  • WeChat Group โ€” Add WeChat ID xuming624, note "llm", to join the tech group

Citation

If you use TreeSearch in your research, please cite:

@software{xu2026treesearch,
  author = {Xu, Ming},
  title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/shibing624/TreeSearch}
}

License

Apache License 2.0

Contributing

Contributions are welcome! Please submit a Pull Request.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytreesearch-0.2.3.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytreesearch-0.2.3-py3-none-any.whl (36.8 kB view details)

Uploaded Python 3

File details

Details for the file pytreesearch-0.2.3.tar.gz.

File metadata

  • Download URL: pytreesearch-0.2.3.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2f270702358a98833f061007f6e04f6797b9c5ff37cbfa638b2205a47cbea621
MD5 2f63a8749bd2f8981e950d4346c3a8f2
BLAKE2b-256 1638b4e4bb52a5a4e8dd22f01e2d3f59c0abb0816a00f6c4fdb3a2df09ba7957

See more details on using hashes here.

File details

Details for the file pytreesearch-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: pytreesearch-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 36.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b45b8b536454fa14cb95f2ff2720806e804ae43a32d95b411380bc9453abc98c
MD5 631120ef7ed5ab8e9af2c6e38fca47a0
BLAKE2b-256 77b524f67358967ce8f4cee8763a062ff479f16d4c59dbd7b68216d62f5bf7bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page