Structure-aware document retrieval without embeddings.
Project description
TreeSearch: Structure-Aware Document Retrieval
TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. BM25 + LLM reasoning over document tree structures.
Installation
pip install -U pytreesearch
Quick Start
import asyncio
from treesearch import build_index, load_index, Document, search
async def main():
# Build indexes for markdown files
await build_index(paths=["docs/*.md"], output_dir="./indexes")
# Load indexed documents
import os
documents = []
for fp in sorted(os.listdir("./indexes")):
if not fp.endswith(".json"):
continue
data = load_index(os.path.join("./indexes", fp))
documents.append(Document(
doc_id=fp, doc_name=data["doc_name"],
structure=data["structure"],
doc_description=data.get("doc_description", ""),
))
# Search with Best-First strategy (BM25 + LLM)
result = await search(query="How does auth work?", documents=documents)
for doc_result in result.documents:
for node in doc_result["nodes"]:
print(f"[{node['score']:.2f}] {node['title']}")
asyncio.run(main())
Set up API key first:
export OPENAI_API_KEY="sk-..."
# Optional: custom endpoint
export OPENAI_BASE_URL="https://your-endpoint/v1"
Why TreeSearch?
Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.
TreeSearch takes a fundamentally different approach โ parse documents into tree structures based on their natural heading hierarchy, then use BM25 + LLM reasoning to navigate the tree and find the most relevant sections.
| Traditional RAG | TreeSearch | |
|---|---|---|
| Preprocessing | Chunk splitting + embedding | Parse headings โ build tree |
| Retrieval | Vector similarity search | BM25 pre-scoring + LLM tree search |
| Multi-doc | Needs vector DB for routing | LLM routes by document descriptions |
| Structure | Lost after chunking | Fully preserved as tree hierarchy |
| Dependencies | Vector DB + embedding model | LLM only (no embedding, no vector DB) |
| Zero-cost baseline | N/A | BM25-only search (no LLM needed) |
Key Advantages
- No vector embeddings โ No embedding model to train, deploy, or pay for
- No chunk splitting โ Documents retain their natural heading structure
- No vector DB โ No Pinecone, Milvus, or Chroma to manage
- Tree-aware retrieval โ Heading hierarchy guides search, not arbitrary chunk boundaries
- BM25 zero-cost baseline โ Instant keyword search with no API calls, useful as standalone or pre-filter
- Budget-controlled LLM calls โ Set max LLM calls per query, with early stopping when confidence is high
Features
- Three-layer search โ BM25 pre-scoring โ Best-First tree search โ LLM relevance evaluation
- Tree-structured indexing โ Markdown and plain text documents are parsed into hierarchical trees
- BM25 node-level index โ Structure-aware scoring with hierarchical field weighting (title > summary > body) and ancestor propagation
- Best-First search (default) โ Priority queue driven, deterministic, with early stopping and budget control
- MCTS search โ Monte Carlo Tree Search with LLM as value function
- LLM single-pass โ One LLM call per document for minimal cost
- Multi-document search โ Route queries across document collections via LLM reasoning
- Chinese + English โ Built-in jieba tokenization for Chinese and regex tokenization for English
- Batch indexing โ
build_index()supports glob patterns for concurrent multi-file processing - Evaluation metrics โ Built-in Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
- Async-first โ All core functions are async with sync wrappers available
- CLI included โ
treesearch indexandtreesearch searchcommands
BM25 Standalone (No LLM Needed)
from treesearch import NodeBM25Index, Document, load_index
data = load_index("indexes/my_doc.json")
doc = Document(doc_id="doc1", doc_name=data["doc_name"], structure=data["structure"])
index = NodeBM25Index([doc])
results = index.search("authentication config", top_k=5)
for r in results:
print(f"[{r['bm25_score']:.4f}] {r['title']}")
CLI
# Build indexes from glob pattern
treesearch index --paths "docs/*.md" --add-description
# Search with Best-First (default, BM25 + LLM)
treesearch search --index_dir ./indexes/ --query "How does auth work?"
# Search with MCTS strategy
treesearch search --index_dir ./indexes/ --query "deployment" --strategy mcts
# Control LLM budget
treesearch search --index_dir ./indexes/ --query "auth" --max-llm-calls 10
How It Works
Input Documents (MD/TXT)
โ
โผ
โโโโโโโโโโโโ
โ Indexer โ Parse headings โ build tree โ generate summaries
โโโโโโฌโโโโโโ (build_index supports glob for batch processing)
โ JSON index files
โผ
โโโโโโโโโโโโ
โ search โ BM25 pre-score โ route to docs โ tree search
โโโโโโฌโโโโโโ
โ SearchResult
โผ
Ranked nodes with scores and text
Layer 1 โ BM25 Pre-Scoring: NodeBM25Index scores all tree nodes using structure-aware BM25 with hierarchical field weighting (title > summary > body) and ancestor score propagation. Instant, no LLM needed.
Layer 2 โ Best-First Tree Search: BestFirstTreeSearch uses a priority queue to expand the most promising nodes. LLM evaluates each node's relevance (title + summary only). Early stopping when top score drops below threshold.
Layer 3 โ Results: Budget-controlled LLM calls with subtree caching for reuse across similar queries.
Search Strategies
| Strategy | Description | LLM Calls | Best For |
|---|---|---|---|
best_first (default) |
BM25 pre-scoring + priority queue + LLM evaluation | Moderate (budget-controlled) | General-purpose, best accuracy |
mcts |
Monte Carlo Tree Search with LLM as value function | High | Complex reasoning queries |
llm |
Single LLM call per document | Minimal | Low-cost, simple queries |
| BM25-only | NodeBM25Index.search() standalone |
Zero | Instant keyword search, no API key |
Examples
| Example | Description |
|---|---|
01_index_and_search.py |
Single document indexing + BestFirst search |
02_text_indexing.py |
Plain text โ tree index with auto heading detection |
03_cli_workflow.py |
CLI workflow: build indexes + search with strategies |
04_multi_doc_search.py |
Multi-doc BM25 + BestFirst + strategy comparison + Chinese |
05_benchmark.py |
Benchmark: BM25 / BestFirst / MCTS / LLM with metrics |
Project Structure
treesearch/
โโโ llm.py # Async LLM client with retry and JSON extraction
โโโ tree.py # Document dataclass, tree operations, persistence
โโโ indexer.py # Markdown / plain text โ tree structure, batch build_index()
โโโ search.py # Best-First, MCTS, LLM search, document routing, unified search() API
โโโ rank_bm25.py # BM25Okapi, NodeBM25Index, Chinese/English tokenizer
โโโ metrics.py # Evaluation: Precision@K, Recall@K, MRR, NDCG@K, Hit@K, F1@K
โโโ cli.py # CLI entry point (index / search)
Documentation
- Architecture โ Design principles and three-layer architecture
- API Reference โ Complete API documentation
Community
- GitHub Issues โ Submit an issue
- WeChat Group โ Add WeChat ID
xuming624, note "llm", to join the tech group
Citation
If you use TreeSearch in your research, please cite:
@software{xu2026treesearch,
author = {Xu, Ming},
title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
year = {2026},
publisher = {GitHub},
url = {https://github.com/shibing624/TreeSearch}
}
License
Contributing
Contributions are welcome! Please submit a Pull Request.
Acknowledgements
- BM25 (Okapi BM25) โ The classic probabilistic ranking function
- jieba โ Chinese text segmentation
- OpenAI API โ LLM reasoning backbone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytreesearch-0.2.1.tar.gz.
File metadata
- Download URL: pytreesearch-0.2.1.tar.gz
- Upload date:
- Size: 42.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c879d0d81c29c82ec88a25afd2066f6a2d9cb10b7ed71e2f3830296cdbaf073
|
|
| MD5 |
52f6850c81e39eb5c6a85e0b57f91338
|
|
| BLAKE2b-256 |
3d0b512f33ae862dc7b22db2eb81a531195672dfe9461c1d9f0a6ac1ad7385b7
|
File details
Details for the file pytreesearch-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pytreesearch-0.2.1-py3-none-any.whl
- Upload date:
- Size: 36.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a864c59a38e9f056d5001feae2e2738987990fb69b9ff8ee6ff8417172eef3c2
|
|
| MD5 |
5a722cdd363854fc49cecdffa3a80bb1
|
|
| BLAKE2b-256 |
89a3bfac3f6eb4849a81695fd18b89d1b3484cef2e9f37d6f60368bf935236e6
|