Structure-aware document retrieval without embeddings.

These details have not been verified by PyPI

Project links

Project description

TreeSearch: Structure-Aware Document Retrieval

TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. BM25 + LLM reasoning over document tree structures.

Installation

pip install -U pytreesearch

Quick Start

import asyncio
from treesearch import build_index, load_index, Document, search

async def main():
    # Build indexes for markdown files
    await build_index(paths=["docs/*.md"], output_dir="./indexes")

    # Load indexed documents
    import os
    documents = []
    for fp in sorted(os.listdir("./indexes")):
        if not fp.endswith(".json"):
            continue
        data = load_index(os.path.join("./indexes", fp))
        documents.append(Document(
            doc_id=fp, doc_name=data["doc_name"],
            structure=data["structure"],
            doc_description=data.get("doc_description", ""),
        ))

    # Search with Best-First strategy (BM25 + LLM)
    result = await search(query="How does auth work?", documents=documents)
    for doc_result in result.documents:
        for node in doc_result["nodes"]:
            print(f"[{node['score']:.2f}] {node['title']}")

asyncio.run(main())

Set up API key first:

export OPENAI_API_KEY="sk-..."
# Optional: custom endpoint
export OPENAI_BASE_URL="https://your-endpoint/v1"

Why TreeSearch?

Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.

TreeSearch takes a fundamentally different approach — parse documents into tree structures based on their natural heading hierarchy, then use BM25 + LLM reasoning to navigate the tree and find the most relevant sections.

	Traditional RAG	TreeSearch
Preprocessing	Chunk splitting + embedding	Parse headings → build tree
Retrieval	Vector similarity search	BM25 pre-scoring + LLM tree search
Multi-doc	Needs vector DB for routing	LLM routes by document descriptions
Structure	Lost after chunking	Fully preserved as tree hierarchy
Dependencies	Vector DB + embedding model	LLM only (no embedding, no vector DB)
Zero-cost baseline	N/A	BM25-only search (no LLM needed)

Key Advantages

No vector embeddings — No embedding model to train, deploy, or pay for
No chunk splitting — Documents retain their natural heading structure
No vector DB — No Pinecone, Milvus, or Chroma to manage
Tree-aware retrieval — Heading hierarchy guides search, not arbitrary chunk boundaries
BM25 zero-cost baseline — Instant keyword search with no API calls, useful as standalone or pre-filter
Budget-controlled LLM calls — Set max LLM calls per query, with early stopping when confidence is high

Features

Three-layer search — BM25 pre-scoring → Best-First tree search → LLM relevance evaluation
Tree-structured indexing — Markdown and plain text documents are parsed into hierarchical trees
BM25 node-level index — Structure-aware scoring with hierarchical field weighting (title > summary > body) and ancestor propagation
Best-First search (default) — Priority queue driven, deterministic, with early stopping and budget control
MCTS search — Monte Carlo Tree Search with LLM as value function
LLM single-pass — One LLM call per document for minimal cost
Multi-document search — Route queries across document collections via LLM reasoning
Chinese + English — Built-in jieba tokenization for Chinese and regex tokenization for English
Batch indexing — build_index() supports glob patterns for concurrent multi-file processing
Evaluation metrics — Built-in Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
Async-first — All core functions are async with sync wrappers available
CLI included — treesearch index and treesearch search commands

BM25 Standalone (No LLM Needed)

from treesearch import NodeBM25Index, Document, load_index

data = load_index("indexes/my_doc.json")
doc = Document(doc_id="doc1", doc_name=data["doc_name"], structure=data["structure"])

index = NodeBM25Index([doc])
results = index.search("authentication config", top_k=5)
for r in results:
    print(f"[{r['bm25_score']:.4f}] {r['title']}")

CLI

# Build indexes from glob pattern
treesearch index --paths "docs/*.md" --add-description

# Search with Best-First (default, BM25 + LLM)
treesearch search --index_dir ./indexes/ --query "How does auth work?"

# Search with MCTS strategy
treesearch search --index_dir ./indexes/ --query "deployment" --strategy mcts

# Control LLM budget
treesearch search --index_dir ./indexes/ --query "auth" --max-llm-calls 10

How It Works

Input Documents (MD/TXT)
        │
        ▼
   ┌──────────┐
   │  Indexer  │  Parse headings → build tree → generate summaries
   └────┬─────┘    (build_index supports glob for batch processing)
        │  JSON index files
        ▼
   ┌──────────┐
   │  search   │  BM25 pre-score → route to docs → tree search
   └────┬─────┘
        │  SearchResult
        ▼
  Ranked nodes with scores and text

Layer 1 — BM25 Pre-Scoring: NodeBM25Index scores all tree nodes using structure-aware BM25 with hierarchical field weighting (title > summary > body) and ancestor score propagation. Instant, no LLM needed.

Layer 2 — Best-First Tree Search: BestFirstTreeSearch uses a priority queue to expand the most promising nodes. LLM evaluates each node's relevance (title + summary only). Early stopping when top score drops below threshold.

Layer 3 — Results: Budget-controlled LLM calls with subtree caching for reuse across similar queries.

Search Strategies

Strategy	Description	LLM Calls	Best For
`best_first` (default)	BM25 pre-scoring + priority queue + LLM evaluation	Moderate (budget-controlled)	General-purpose, best accuracy
`mcts`	Monte Carlo Tree Search with LLM as value function	High	Complex reasoning queries
`llm`	Single LLM call per document	Minimal	Low-cost, simple queries
BM25-only	`NodeBM25Index.search()` standalone	Zero	Instant keyword search, no API key

Examples

Example	Description
`01_index_and_search.py`	Single document indexing + BestFirst search
`02_text_indexing.py`	Plain text → tree index with auto heading detection
`03_cli_workflow.py`	CLI workflow: build indexes + search with strategies
`04_multi_doc_search.py`	Multi-doc BM25 + BestFirst + strategy comparison + Chinese
`05_benchmark.py`	Benchmark: BM25 / BestFirst / MCTS / LLM with metrics

Project Structure

treesearch/
├── llm.py            # Async LLM client with retry and JSON extraction
├── tree.py           # Document dataclass, tree operations, persistence
├── indexer.py        # Markdown / plain text → tree structure, batch build_index()
├── search.py         # Best-First, MCTS, LLM search, document routing, unified search() API
├── rank_bm25.py      # BM25Okapi, NodeBM25Index, Chinese/English tokenizer
├── metrics.py        # Evaluation: Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
└── cli.py            # CLI entry point (index / search)

Documentation

Architecture — Design principles and three-layer architecture
API Reference — Complete API documentation

Community

GitHub Issues — Submit an issue
WeChat Group — Add WeChat ID xuming624, note "llm", to join the tech group

Citation

If you use TreeSearch in your research, please cite:

@software{xu2026treesearch,
  author = {Xu, Ming},
  title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/shibing624/TreeSearch}
}

License

Apache License 2.0

Contributing

Contributions are welcome! Please submit a Pull Request.

Acknowledgements

BM25 (Okapi BM25) — The classic probabilistic ranking function
jieba — Chinese text segmentation
OpenAI API — LLM reasoning backbone

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Apr 29, 2026

1.1.0

Apr 22, 2026

1.0.8

Apr 17, 2026

1.0.7

Apr 16, 2026

1.0.6

Apr 16, 2026

1.0.5

Apr 13, 2026

1.0.4

Apr 2, 2026

1.0.3

Apr 1, 2026

1.0.2

Mar 24, 2026

1.0.1

Mar 20, 2026

1.0.0

Mar 20, 2026

0.6.4

Mar 17, 2026

0.6.3

Mar 16, 2026

0.6.2

Mar 14, 2026

0.6.1

Mar 12, 2026

0.6.0

Mar 11, 2026

0.5.7

Mar 11, 2026

0.5.6

Mar 11, 2026

0.5.5

Mar 10, 2026

0.5.4

Mar 10, 2026

0.5.3

Mar 10, 2026

0.5.2

Mar 10, 2026

0.5.1

Mar 10, 2026

0.5.0

Mar 10, 2026

0.2.3

Feb 25, 2026

This version

0.2.1

Feb 25, 2026

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytreesearch-0.2.1.tar.gz (42.1 kB view details)

Uploaded Feb 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytreesearch-0.2.1-py3-none-any.whl (36.8 kB view details)

Uploaded Feb 25, 2026 Python 3

File details

Details for the file pytreesearch-0.2.1.tar.gz.

File metadata

Download URL: pytreesearch-0.2.1.tar.gz
Upload date: Feb 25, 2026
Size: 42.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`0c879d0d81c29c82ec88a25afd2066f6a2d9cb10b7ed71e2f3830296cdbaf073`
MD5	`52f6850c81e39eb5c6a85e0b57f91338`
BLAKE2b-256	`3d0b512f33ae862dc7b22db2eb81a531195672dfe9461c1d9f0a6ac1ad7385b7`

See more details on using hashes here.

File details

Details for the file pytreesearch-0.2.1-py3-none-any.whl.

File metadata

Download URL: pytreesearch-0.2.1-py3-none-any.whl
Upload date: Feb 25, 2026
Size: 36.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a864c59a38e9f056d5001feae2e2738987990fb69b9ff8ee6ff8417172eef3c2`
MD5	`5a722cdd363854fc49cecdffa3a80bb1`
BLAKE2b-256	`89a3bfac3f6eb4849a81695fd18b89d1b3484cef2e9f37d6f60368bf935236e6`

See more details on using hashes here.

pytreesearch 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TreeSearch: Structure-Aware Document Retrieval

Installation

Quick Start

Why TreeSearch?

Key Advantages

Features

BM25 Standalone (No LLM Needed)

CLI

How It Works

Search Strategies

Examples

Project Structure

Documentation

Community

Citation

License

Contributing

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes