Structure-aware document retrieval. FTS5/BM25 keyword matching over document trees.
Project description
TreeSearch: Structure-Aware Document Retrieval
TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. SQLite FTS5 keyword matching over document tree structures. Supports Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++ etc.), HTML, XML, JSON, CSV, PDF, and DOCX.
Millisecond-latency search over tens of thousands of documents and large codebases, with structure preservation.
Installation
pip install -U pytreesearch
Quick Start
from treesearch import TreeSearch
# Lazy indexing โ auto-builds index on first search
ts = TreeSearch("docs/*.md", "src/*.py")
results = ts.search("How does auth work?")
for doc in results["documents"]:
for node in doc["nodes"]:
print(f"[{node['score']:.2f}] {node['title']}")
print(f" {node['text'][:200]}")
Why TreeSearch?
Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.
TreeSearch takes a fundamentally different approach โ parse documents into tree structures based on their natural heading hierarchy, then search with FTS5 keyword matching (zero-cost, no API key needed).
| Traditional RAG | TreeSearch | |
|---|---|---|
| Preprocessing | Chunk splitting + embedding | Parse headings โ build tree |
| Retrieval | Vector similarity search | FTS5 keyword matching (no LLM needed) |
| Multi-doc | Needs vector DB for routing | FTS5 cross-doc scoring |
| Structure | Lost after chunking | Fully preserved as tree hierarchy |
| Dependencies | Vector DB + embedding model | SQLite only (no embedding, no vector DB) |
Key Advantages
- No vector embeddings โ No embedding model to train, deploy, or pay for
- No chunk splitting โ Documents retain their natural heading structure
- No vector DB โ No Pinecone, Milvus, or Chroma to manage
- Tree-aware retrieval โ Heading hierarchy guides search, not arbitrary chunk boundaries
- SQLite FTS5 engine โ Persistent inverted index with WAL mode, incremental updates, CJK support, and SQL aggregation
Features
- FTS5 search โ Zero LLM calls, millisecond-level FTS5 keyword matching, no API key needed
- SQLite FTS5 engine โ Persistent inverted index, WAL mode, incremental updates, MD structure-aware columns (title/summary/body/code/front_matter), column weighting, CJK tokenization
- Tree-structured indexing โ Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++/PHP), HTML, XML, JSON, CSV, PDF, and DOCX are parsed into hierarchical trees
- Parser registry โ Extensible
ParserRegistrywith built-in parsers auto-registered; custom parsers viaParserRegistry.register() - Python AST parsing โ
astmodule extracts classes/functions with full signatures (parameters, return types); regex fallback for syntax errors - PDF/DOCX/HTML parsers โ Optional parsers via
pageindex,python-docx,beautifulsoup4(install withpip install pytreesearch[all]) - GrepFilter โ Exact literal/regex matching for precise symbol and keyword search across tree nodes
- Source-type routing โ Automatic pre-filter selection based on file type (e.g., code files use GrepFilter + FTS5)
- Chinese + English โ Built-in jieba tokenization for Chinese and regex tokenization for English
- Batch indexing โ
build_index()supports glob patterns for concurrent multi-file processing - Async-first โ All core functions are async with sync wrappers available
- Config-driven defaults โ
search()andbuild_index()read defaults fromget_config(), overridable per-call - CLI included โ
treesearch indexandtreesearch searchcommands
FTS5 Standalone
from treesearch import FTS5Index, Document, load_index
data = load_index("indexes/my_doc.json")
doc = Document(doc_id="doc1", doc_name=data["doc_name"], structure=data["structure"])
fts = FTS5Index(db_path="indexes/fts.db") # persistent, or omit for in-memory
fts.index_documents([doc])
# Simple keyword search
results = fts.search("authentication config", top_k=5)
for r in results:
print(f"[{r['fts_score']:.4f}] {r['title']}")
# Advanced FTS5 query syntax
results = fts.search("auth", fts_expression='title:auth AND body:config', top_k=5)
# Per-document aggregation
agg = fts.search_with_aggregation("authentication", group_by_doc=True)
for doc_agg in agg:
print(f"{doc_agg['doc_name']}: {doc_agg['hit_count']} hits, best={doc_agg['best_score']:.4f}")
CLI
# Build indexes from glob pattern
treesearch index --paths "docs/*.md" --add-description
# Search with FTS5
treesearch search --index_dir ./indexes/ --query "How does auth work?" --fts
# Search with persistent FTS5 database
treesearch search --index_dir ./indexes/ --query "auth" --fts --fts-db ./indexes/fts.db
How It Works
Input Documents (MD/TXT/Code/JSON/CSV/HTML/XML/PDF/DOCX)
โ
โผ
โโโโโโโโโโโโ
โ Indexer โ ParserRegistry dispatch โ parse structure โ build tree โ generate summaries
โโโโโโฌโโโโโโ (build_index supports glob for batch processing)
โ JSON index files
โผ
โโโโโโโโโโโโ
โ search โ FTS5/Grep pre-filter โ cross-doc scoring โ ranked results
โโโโโโฌโโโโโโ
โ dict result
โผ
Ranked nodes with scores and text
FTS5 Pre-Scoring: FTS5Index uses SQLite FTS5 inverted index with MD structure-aware columns (title/summary/body/code/front_matter) and column weighting for fast scoring. Instant results, no LLM needed.
Source-Type Routing: For code files, GrepFilter + FTS5 are combined automatically for precise symbol matching. The pre-filter is selected based on file type via PREFILTER_ROUTING.
Use Cases
Use Case 1: Technical Documentation QA (Best Scenario)
Problem: Your company has 100+ technical docs (API docs, design docs, RFCs), and traditional search can't find the right answers.
from treesearch import build_index, search
# 1. Build index (run once)
docs = await build_index(
paths=["docs/*.md", "specs/*.txt"],
output_dir="./indexes"
)
# 2. Search โ millisecond response
result = await search(
query="How to configure Redis cluster?",
documents=docs,
)
# 3. Results โ complete sections, not fragments
for doc in result["documents"]:
print(f"Doc: {doc['doc_name']}")
for node in doc["nodes"]:
print(f" Section: {node['title']}")
print(f" Content: {node['text'][:200]}...")
Why better than traditional RAG?
- Finds complete sections, not fragments
- Includes section titles as context anchors
- Supports hierarchical navigation (parent/child sections)
Use Case 2: Codebase Search
Problem: Want to search for "login-related classes and methods" in a large codebase, but grep only finds lines without structure.
# Index codebase
docs = await build_index(
paths=["src/**/*.py", "lib/**/*.java"],
output_dir="./code_indexes"
)
# Search โ auto-detects code files, uses AST parsing + GrepFilter
result = await search(
query="user login authentication",
documents=docs,
)
# Results example:
# Doc: auth_service.py
# class UserAuthenticator
# def login(username, password)
# def verify_token(token)
Why better than grep/IDE search?
- Semantic understanding: Not just keyword matching, understands "login" = "authentication"
- Structure-aware: Finds complete classes/methods with docstrings
- Precise location: Directly locates to code line numbers
Use Case 3: Long Document QA (Papers/Books)
Problem: Have a 50-page paper, want to ask "What experimental methods are mentioned in Chapter 3?"
docs = await build_index(paths=["paper.pdf"])
result = await search(
query="experimental methodology",
documents=docs,
)
# Automatically finds "3.2 Experimental Design" section content
Why better than Ctrl+F?
- Semantic matching: Finds synonymous paragraphs for "experimental methods"
- Section location: Tells you which chapter and section
- Scalable to multi-doc: Search 10 papers simultaneously
Real Case Comparison
Case: Find "How to request GPU machines" in company docs
Traditional way (Ctrl+F):
Search "GPU" โ Found 47 matches โ Manual review โ 10 minutes
TreeSearch way:
result = await search("How to request GPU machines", docs)
# Directly returns "Resource Guide > GPU Request Process" section
# Time: < 100ms
Efficiency gain: 100x
Comparison with Other Solutions
| Solution | Pros | Cons | Best For |
|---|---|---|---|
| Ctrl+F | Simple | No semantic understanding, fragmented results | Known keywords |
| Traditional RAG | Good semantic understanding | Chunking destroys context, slow response | Plain text QA |
| Vector DB | Similarity search | Requires embedding preprocessing, high cost | Large-scale semantic retrieval |
| TreeSearch | Preserves structure + Fast + Zero cost | Requires structured documents | Tech docs/Codebase |
Benchmark
Document Retrieval (QASPER)
Evaluated on QASPER dataset (50 QA samples, 18 academic papers):
| Metric | Embedding (text-embedding-3-small) | TreeSearch FTS5 |
|---|---|---|
| MRR | 0.5403 | 0.4596 |
| Precision@1 | 0.3830 | 0.1915 |
| Recall@5 | 0.5139 | 0.6613 |
| Index Time | 118.7s | 0.0s |
| Query Time | 573ms | 0.7ms |
Key Findings:
- Embedding MRR +18% โ Better semantic understanding
- TreeSearch Recall@5 +29% โ Structure preservation helps recall more relevant content
- TreeSearch 780x faster queries โ Milliseconds vs seconds
- TreeSearch instant indexing โ No embedding API calls needed
Code Retrieval (CodeSearchNet)
Evaluated on CodeSearchNet dataset (50 queries, 500 Python corpus):
| Metric | Embedding (text-embedding-3-small) | TreeSearch FTS5 |
|---|---|---|
| MRR | 0.9567 | 0.8469 |
| Hit@1 | 0.9200 | 0.8000 |
| Recall@5 | 1.0000 | 0.9200 |
| Index Time | 73.7s | 3.3s |
| Query Time | 620ms | 0.8ms |
Key Findings:
- Embedding MRR +13% โ Better code semantic understanding
- TreeSearch MRR 84.7% โ Strong performance for keyword-based code search
- TreeSearch 800x faster queries โ Milliseconds vs seconds
- TreeSearch 22x faster indexing โ No embedding API calls needed
Summary
TreeSearch is not meant to replace embedding-based retrieval, but to provide a zero-cost, ultra-fast alternative. For scenarios prioritizing speed and recall over precision, TreeSearch is the better choice.
Run the benchmarks yourself:
# Document retrieval (QASPER)
python examples/benchmark/qasper_benchmark.py --max-samples 50 --max-papers 20 --with-embedding
# Code retrieval (CodeSearchNet)
python examples/benchmark/codesearchnet_benchmark.py --max-samples 50 --max-corpus 500 --with-embedding
Documentation
- Architecture โ Design principles and architecture
- API Reference โ Complete API documentation
Community
- GitHub Issues โ Submit an issue
- WeChat Group โ Add WeChat ID
xuming624, note "llm", to join the tech group
Citation
If you use TreeSearch in your research, please cite:
@software{xu2026treesearch,
author = {Xu, Ming},
title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
year = {2026},
publisher = {GitHub},
url = {https://github.com/shibing624/TreeSearch}
}
License
Contributing
Contributions are welcome! Please submit a Pull Request.
Acknowledgements
- SQLite FTS5 โ The full-text search engine powering TreeSearch
- VectifyAI/PageIndex โ Inspiration for structure-aware indexing and retrieval
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pytreesearch-0.6.0.tar.gz.
File metadata
- Download URL: pytreesearch-0.6.0.tar.gz
- Upload date:
- Size: 65.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6a16b1104c3ba05356926f997ec12463da2a2a66cecd041e2e681b074d74973
|
|
| MD5 |
585beacbb0a70ee7f9ed55cfea038d2f
|
|
| BLAKE2b-256 |
306faf3bd1460a9c30051211da492a1ed07831ead3e587e8bf84cb7cbeb1096a
|