Skip to main content

Structure-aware document retrieval. FTS5/BM25 keyword matching over document trees.

Project description

๐ŸŒEnglish | ๐Ÿ‡จ๐Ÿ‡ณไธญๆ–‡


TreeSearch: Structure-Aware Document Retrieval

PyPI version Downloads License Apache 2.0 python_version GitHub issues Wechat Group

TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. SQLite FTS5 keyword matching over document tree structures. Supports Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++ etc.), HTML, XML, JSON, CSV, PDF, and DOCX.

Millisecond-latency search over tens of thousands of documents and large codebases, with structure preservation.

Installation

pip install -U pytreesearch

Quick Start

from treesearch import TreeSearch

# Just pass directories โ€” auto-discovers all supported files
ts = TreeSearch("project_root/", "docs/")
results = ts.search("How does auth work?")
for doc in results["documents"]:
    for node in doc["nodes"]:
        print(f"[{node['score']:.2f}] {node['title']}")
        print(f"  {node['text'][:200]}")

Directories are walked recursively with smart defaults:

  • Auto-discovers .py, .md, .json, .jsonl, .java, .go, .ts, .pdf, .docx, etc.
  • Skips .git, node_modules, __pycache__, .venv, dist, build, etc.
  • Respects .gitignore when pathspec is installed(pip install pathspec)
  • Safety cap of 10,000 files per directory (configurable via max_files)

You can also mix directories, files, and glob patterns freely:

# All three input types work together
ts = TreeSearch("src/", "docs/*.md", "README.md")
results = ts.search("authentication")

In-Memory Mode

For quick searches, scripts, or ephemeral use cases, set db_path=None to skip writing any .db file to disk:

# In-memory mode โ€” no index.db file, all indexes kept in memory
ts = TreeSearch("docs/", db_path=None)
results = ts.search("voice calls")

Performance is excellent even with thousands of documents (5,000 docs < 10ms). The trade-off is that indexes are lost when the process exits. For persistent, incremental indexing, use the default db_path or set it to a file path.

Tree Mode (Best for Papers & Documents)

For academic papers, long documents, and technical docs with deep heading hierarchy, use tree mode to get structure-aware best-first search:

from treesearch import TreeSearch

# Tree mode: anchor retrieval โ†’ tree walk โ†’ path aggregation
ts = TreeSearch("papers/", "docs/")
results = ts.search("experimental methodology", search_mode="tree")

# Tree mode returns ranked nodes (same as flat mode)
for doc in results["documents"]:
    for node in doc["nodes"]:
        print(f"[{node['score']:.2f}] {node['title']}")

# Plus: tree traversal paths showing how results connect
for path in results["paths"]:
    chain = " > ".join(p["title"] for p in path["path"])
    print(f"[{path['score']:.2f}] {chain}")
    print(f"  {path['snippet'][:200]}")

When to use which mode?

Mode Best For MRR Advantage
"auto" (default) Auto-selects based on document type Smart default
"tree" Academic papers, technical docs with heading hierarchy Best on QASPER (+18%)
"flat" Code search, keyword-heavy queries Best on CodeSearchNet (0.84)

Auto Mode (search_mode="auto", default): Intelligently selects tree vs flat using a three-layer strategy:

  1. Type mapping โ€” Each source_type has an explicit tree-benefit flag (_TREE_BENEFIT)
  2. Depth verification โ€” Only docs with actual tree depth โ‰ฅ 2 count as hierarchical
  3. Proportion threshold โ€” If โ‰ฅ 30% of docs truly benefit from tree โ†’ tree mode; otherwise โ†’ flat

This avoids the old "1 markdown among 50 code files โ†’ tree for everything" problem.

Document Type Tree Benefit? Depth Check Auto Mode
Markdown (.md) โœ… Yes Must have headings (depth โ‰ฅ 2) tree if deep
JSON (.json) โœ… Yes Must have nesting (depth โ‰ฅ 2) tree if nested
Code (.py/.js/.go...) โŒ No โ€” flat
PDF (.pdf) โŒ No โ€” flat
DOCX (.docx) โŒ No โ€” flat
CSV (.csv) โŒ No โ€” flat
Text (.txt) โŒ No โ€” flat
JSONL (.jsonl) โŒ No โ€” flat
Unknown types โŒ No (safe default) โ€” flat

Why TreeSearch?

Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.

TreeSearch takes a fundamentally different approach โ€” parse documents into tree structures based on their natural heading hierarchy, then search with FTS5 keyword matching (zero-cost, no API key needed).

Traditional RAG TreeSearch
Preprocessing Chunk splitting + embedding Parse headings โ†’ build tree
Retrieval Vector similarity search FTS5 keyword matching (no LLM needed)
Multi-doc Needs vector DB for routing FTS5 cross-doc scoring
Structure Lost after chunking Fully preserved as tree hierarchy
Dependencies Vector DB + embedding model SQLite only (no embedding, no vector DB)

Key Advantages

  • No vector embeddings โ€” No embedding model to train, deploy, or pay for
  • No chunk splitting โ€” Documents retain their natural heading structure
  • No vector DB โ€” No Pinecone, Milvus, or Chroma to manage
  • Tree-aware retrieval โ€” Heading hierarchy guides search, not arbitrary chunk boundaries
  • SQLite FTS5 engine โ€” Persistent inverted index with WAL mode, incremental updates, CJK support, and SQL aggregation

Features

  • Smart directory discovery โ€” ts.index("src/") recursively discovers all supported files; skips .git/node_modules/__pycache__; respects .gitignore
  • FTS5 search โ€” Zero LLM calls, millisecond-level FTS5 keyword matching, no API key needed
  • SQLite FTS5 engine โ€” Persistent inverted index, WAL mode, incremental updates, MD structure-aware columns (title/summary/body/code/front_matter), column weighting, CJK tokenization
  • Tree-structured indexing โ€” Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++/PHP), HTML, XML, JSON, CSV, PDF, and DOCX are parsed into hierarchical trees
  • Ripgrep-accelerated GrepFilter โ€” Auto-uses system rg for fast line-level matching with transparent native Python fallback; hit-count-based scoring ranks multi-match nodes higher
  • Parser registry โ€” Extensible ParserRegistry with built-in parsers auto-registered; custom parsers via ParserRegistry.register()
  • Python AST parsing โ€” ast module extracts classes/functions with full signatures (parameters, return types); regex fallback for syntax errors
  • PDF/DOCX/HTML parsers โ€” Optional parsers via PyMuPDF, python-docx, beautifulsoup4 (install with pip install pytreesearch[all])
  • GrepFilter โ€” Exact literal/regex matching for precise symbol and keyword search across tree nodes
  • Source-type routing โ€” Automatic pre-filter selection based on file type (e.g., code files use GrepFilter + FTS5)
  • Chinese + English โ€” Built-in jieba tokenization for Chinese and regex tokenization for English
  • Batch indexing โ€” build_index() supports glob patterns, files, and directories for concurrent multi-file processing
  • Async-first โ€” All core functions are async with sync wrappers available
  • Config-driven defaults โ€” search() and build_index() read defaults from get_config(), overridable per-call
  • CLI included โ€” treesearch "query" path/ for instant search; treesearch index and treesearch search for advanced workflows

FTS5 Standalone

from treesearch import FTS5Index, Document, load_index, save_index, md_to_tree
import asyncio

# Option 1: Build from Markdown and save to DB
result = asyncio.run(md_to_tree(md_path="docs/voice-call.md", if_add_node_summary=True))
save_index(result, "indexes/voice-call.db")

# Option 2: Load a previously saved document from DB
doc = load_index("indexes/voice-call.db")  # returns a Document object

# Create FTS5 index and search
fts = FTS5Index(db_path="indexes/fts.db")  # persistent, or omit for in-memory
fts.index_documents([doc])

# Simple keyword search
results = fts.search("authentication config", top_k=5)
for r in results:
    print(f"[{r['fts_score']:.4f}] {r['title']}")

# Advanced FTS5 query syntax
results = fts.search("auth", fts_expression='title:auth AND body:config', top_k=5)

# Per-document aggregation
agg = fts.search_with_aggregation("authentication", group_by_doc=True)
for doc_agg in agg:
    print(f"{doc_agg['doc_name']}: {doc_agg['hit_count']} hits, best={doc_agg['best_score']:.4f}")

CLI

# Default mode: one command does everything (lazy index + search)
treesearch "How does auth work?" src/ docs/
treesearch "configure Redis" project/

# With options
treesearch "auth" src/ --max-nodes 10 --db ./my_index.db

# Advanced: build index separately (for large codebases)
treesearch index --paths src/ docs/ --add-description
treesearch index --paths "docs/*.md" "src/**/*.py" --add-description

# Advanced: search a pre-built index
treesearch search --index_dir ./indexes/ --query "How does auth work?"

How It Works

Input Documents (MD/TXT/Code/JSON/CSV/HTML/XML/PDF/DOCX)
        โ”‚
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  Indexer  โ”‚  ParserRegistry dispatch โ†’ parse structure โ†’ build tree โ†’ generate summaries
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    (build_index supports glob for batch processing)
        โ”‚  SQLite DB (FTS5)
        โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚  search   โ”‚  FTS5/Grep pre-filter โ†’ cross-doc scoring โ†’ ranked results
   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚  dict result
        โ–ผ
  Ranked nodes with scores and text

Flat Mode (default): FTS5Index uses SQLite FTS5 inverted index with structure-aware columns (title/summary/body/code/front_matter) and column weighting for fast scoring. Instant results, no LLM needed.

Tree Mode: Best-first search over document trees โ€” FTS5 finds anchor nodes, then walks the tree (parent/child/sibling) with heuristic scoring (title match, term overlap, IDF weighting, generic section demotion) to find optimal paths through the document hierarchy.

Source-Type Routing: For code files, GrepFilter + FTS5 are combined automatically for precise symbol matching. The pre-filter is selected based on file type via PREFILTER_ROUTING.

Use Cases

Use Case 1: Technical Documentation QA (Best Scenario)

Problem: Your company has 100+ technical docs (API docs, design docs, RFCs), and traditional search can't find the right answers.

from treesearch import build_index, search

# 1. Build index โ€” just pass directories (run once)
docs = await build_index(
    paths=["docs/", "specs/"],
    output_dir="./indexes"
)

# 2. Search โ€” millisecond response
result = await search(
    query="How to configure Redis cluster?",
    documents=docs,
)

# 3. Results โ€” complete sections, not fragments
for doc in result["documents"]:
    print(f"Doc: {doc['doc_name']}")
    for node in doc["nodes"]:
        print(f"  Section: {node['title']}")
        print(f"  Content: {node['text'][:200]}...")

Why better than traditional RAG?

  • Finds complete sections, not fragments
  • Includes section titles as context anchors
  • Supports hierarchical navigation (parent/child sections)

Use Case 2: Codebase Search

Problem: Want to search for "login-related classes and methods" in a large codebase, but grep only finds lines without structure.

# Index entire directories โ€” auto-discovers .py, .java, .go, etc.
docs = await build_index(
    paths=["src/", "lib/"],
    output_dir="./code_indexes"
)

# Search โ€” auto-detects code files, uses AST parsing + GrepFilter (ripgrep-accelerated)
result = await search(
    query="user login authentication",
    documents=docs,
)

# Results example:
# Doc: auth_service.py
#   class UserAuthenticator
#     def login(username, password)
#     def verify_token(token)

Why better than grep/IDE search?

  • Semantic understanding: Not just keyword matching, understands "login" = "authentication"
  • Structure-aware: Finds complete classes/methods with docstrings
  • Precise location: Directly locates to code line numbers

Use Case 3: Long Document QA (Papers/Books)

Problem: Have a 50-page paper, want to ask "What experimental methods are mentioned in Chapter 3?"

docs = await build_index(paths=["paper.pdf"])

result = await search(
    query="experimental methodology",
    documents=docs,
)

# Automatically finds "3.2 Experimental Design" section content

Why better than Ctrl+F?

  • Semantic matching: Finds synonymous paragraphs for "experimental methods"
  • Section location: Tells you which chapter and section
  • Scalable to multi-doc: Search 10 papers simultaneously

Real Case Comparison

Case: Find "How to request GPU machines" in company docs

Traditional way (Ctrl+F):

Search "GPU" โ†’ Found 47 matches โ†’ Manual review โ†’ 10 minutes

TreeSearch way:

result = await search("How to request GPU machines", docs)
# Directly returns "Resource Guide > GPU Request Process" section
# Time: < 100ms

Efficiency gain: 100x

Comparison with Other Solutions

Solution Pros Cons Best For
Ctrl+F Simple No semantic understanding, fragmented results Known keywords
Vector DB Similarity search Requires embedding preprocessing, high cost Large-scale semantic retrieval
TreeSearch Preserves structure + Fast + Zero cost Requires structured documents Tech docs/Codebase

Benchmark

Document Retrieval (QASPER)

Evaluated on QASPER dataset (47 queries, 18 academic papers):

Metric Embedding (zhipu-embedding-3) TreeSearch FTS5 TreeSearch Tree
MRR 0.4235 0.4033 0.4763
Precision@1 0.2553 0.2128 0.2979
Recall@5 0.4259 0.3387 0.4344
Hit@5 0.6383 0.7021 0.7660
NDCG@3 0.4245 0.2929 0.3702
Index Time 22.8s 0.1s 0.1s
Avg Query Time 151.8ms 0.8ms 1.2ms

Key Findings:

  • ๐Ÿ† Tree mode wins MRR (0.48 vs 0.42 Embedding vs 0.40 FTS5) โ€” Structure-aware tree walk boosts ranking quality
  • Tree mode MRR +18.1% over FTS5 on academic papers
  • Tree mode Recall@5 +35% over Embedding โ€” Hierarchical traversal finds more relevant content
  • Tree mode Hit@5 0.77 vs Embedding 0.64 โ€” Significantly better coverage
  • TreeSearch 126x faster queries โ€” Sub-millisecond vs hundreds of milliseconds

Financial Document Retrieval (FinanceBench)

Evaluated on FinanceBench dataset (50 queries, SEC filings):

Metric Embedding (zhipu-embedding-3) TreeSearch FTS5 TreeSearch Tree
MRR 0.2206 0.2420 0.2386
Precision@1 0.1000 0.1000 0.1200
Recall@5 0.2782 0.2067 0.2076
Hit@5 0.3600 0.5200 0.5400
NDCG@10 0.2852 0.3680 0.3287
Index Time 406.0s 0.24s 0.24s
Avg Query Time 154.3ms 16.5ms 47.6ms

Key Findings:

  • ๐Ÿ† FTS5 and Tree nearly tied on MRR (0.24 vs 0.24) โ€” Both close toๆŒๅนณ
  • Tree mode Hit@5 slightly higher (0.54 vs 0.52) โ€” Better recall on SEC filings
  • FTS5 Precision@1 = 0.30 โ€” 3x better than Embedding (0.10) on exact term matching
  • TreeSearch 1692x faster indexing โ€” 0.24s vs 406s (no embedding API calls for large documents)
  • TreeSearch 9x faster queries โ€” Milliseconds vs hundreds of milliseconds

Code Retrieval (CodeSearchNet)

Evaluated on CodeSearchNet dataset (50 queries, 500 Python corpus):

Metric Embedding (zhipu-embedding-3) TreeSearch FTS5 TreeSearch Tree
MRR 0.8483 0.8400 0.0029
Precision@1 0.7800 0.8200 0.0000
Recall@5 0.9400 0.8600 0.0000
Hit@1 0.7800 0.8200 0.0000
Index Time 33.8s 2.8s 2.8s
Avg Query Time 166.0ms 1.7ms 1382.4ms

Key Findings:

  • TreeSearch FTS5 MRR nearly matches Embedding (0.84 vs 0.85) โ€” BM25 excels on code with high lexical overlap
  • Tree mode completely fails on code search (MRR=0.003) โ€” Do NOT use Tree mode for code; use search_mode="auto" (auto-resolves to flat for code-only corpora)
  • TreeSearch Precision@1 wins (0.82 vs 0.78) โ€” Exact keyword matching is strong for code search
  • TreeSearch 98x faster queries โ€” Milliseconds vs hundreds of milliseconds
  • TreeSearch 12x faster indexing โ€” No embedding API calls needed

Summary

TreeSearch provides zero-cost, ultra-fast retrieval that outperforms embeddings on structured documents. Tree mode excels on academic papers (MRR +18% over Embedding), FTS5 mode dominates financial documents (MRR +80% over Embedding), and both modes match embeddings on code search โ€” all at 100x+ faster query speed.

Benchmark Best Mode MRR vs Embedding Query Speed
QASPER (Academic Papers) Tree 0.4763 +18% 126x faster
FinanceBench (SEC Filings) FTS5 0.2420 +10% 9x faster
CodeSearchNet (Python) FTS5 0.8400 โˆ’1% 98x faster

Note: FinanceBench Tree vs FTS5 gap narrowed to 1.4% after algorithm improvements. CodeSearchNet Tree mode is not recommended (MRR~0); use search_mode="auto" or "flat" for code.

Run the benchmarks yourself:

# Document retrieval (QASPER)
python examples/benchmark/qasper_benchmark.py --max-samples 50 --max-papers 20 --with-embedding

# Financial document retrieval (FinanceBench)
python examples/benchmark/financebench_benchmark.py --max-samples 50 --with-embedding

# Code retrieval (CodeSearchNet)
python examples/benchmark/codesearchnet_benchmark.py --max-samples 50 --max-corpus 500 --with-embedding

Documentation

Community

  • GitHub Issues โ€” Submit an issue
  • WeChat Group โ€” Add WeChat ID xuming624, note "nlp", to join the tech group

Citation

If you use TreeSearch in your research, please cite:

@software{xu2026treesearch,
  author = {Xu, Ming},
  title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/shibing624/TreeSearch}
}

License

Apache License 2.0

Contributing

Contributions are welcome! Please submit a Pull Request.

Acknowledgements

  • SQLite FTS5 โ€” The full-text search engine powering TreeSearch
  • VectifyAI/PageIndex โ€” Inspiration for structure-aware indexing and retrieval

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytreesearch-1.0.3.tar.gz (112.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytreesearch-1.0.3-py3-none-any.whl (101.0 kB view details)

Uploaded Python 3

File details

Details for the file pytreesearch-1.0.3.tar.gz.

File metadata

  • Download URL: pytreesearch-1.0.3.tar.gz
  • Upload date:
  • Size: 112.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-1.0.3.tar.gz
Algorithm Hash digest
SHA256 44dbcf913152e797d29e6d261b3a9eab21c9254e2b0ecbf682c7e767362b360d
MD5 21107189d72a7aae195ef8a8015eb73f
BLAKE2b-256 13eec1d3a4a997d5bf6ad38604dedd5ec282f92b546aae9d26269b72747bdf5d

See more details on using hashes here.

File details

Details for the file pytreesearch-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: pytreesearch-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 101.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0adcdc71333b552d9fa1c551d48190ed94c0993a0a3800d6fb618a2420ca90c0
MD5 12393905e7aff0f557259768f35d516f
BLAKE2b-256 ab109f344fde48df633525c10d5d3219fe20d08da93dccb3015fca4173470330

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page