Structure-aware document retrieval. FTS5/BM25 keyword matching over document trees.

These details have not been verified by PyPI

Project links

Project description

TreeSearch: Structure-Aware Document Retrieval

TreeSearch is a structure-aware document retrieval library. No vector embeddings. No chunk splitting. SQLite FTS5 keyword matching over document tree structures. Supports Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++ etc.), HTML, XML, JSON, CSV, PDF, and DOCX.

Millisecond-latency search over tens of thousands of documents and large codebases, with structure preservation.

Installation

pip install -U pytreesearch

Quick Start

from treesearch import TreeSearch

# Lazy indexing — auto-builds index on first search
ts = TreeSearch("docs/*.md", "src/*.py")
results = ts.search("How does auth work?")
for doc in results["documents"]:
    for node in doc["nodes"]:
        print(f"[{node['score']:.2f}] {node['title']}")
        print(f"  {node['text'][:200]}")

Why TreeSearch?

Traditional RAG systems split documents into fixed-size chunks and retrieve by vector similarity. This destroys document structure, loses heading hierarchy, and misses reasoning-dependent queries.

TreeSearch takes a fundamentally different approach — parse documents into tree structures based on their natural heading hierarchy, then search with FTS5 keyword matching (zero-cost, no API key needed).

	Traditional RAG	TreeSearch
Preprocessing	Chunk splitting + embedding	Parse headings → build tree
Retrieval	Vector similarity search	FTS5 keyword matching (no LLM needed)
Multi-doc	Needs vector DB for routing	FTS5 cross-doc scoring
Structure	Lost after chunking	Fully preserved as tree hierarchy
Dependencies	Vector DB + embedding model	SQLite only (no embedding, no vector DB)

Key Advantages

No vector embeddings — No embedding model to train, deploy, or pay for
No chunk splitting — Documents retain their natural heading structure
No vector DB — No Pinecone, Milvus, or Chroma to manage
Tree-aware retrieval — Heading hierarchy guides search, not arbitrary chunk boundaries
SQLite FTS5 engine — Persistent inverted index with WAL mode, incremental updates, CJK support, and SQL aggregation

Features

FTS5 search — Zero LLM calls, millisecond-level FTS5 keyword matching, no API key needed
SQLite FTS5 engine — Persistent inverted index, WAL mode, incremental updates, MD structure-aware columns (title/summary/body/code/front_matter), column weighting, CJK tokenization
Tree-structured indexing — Markdown, plain text, code files (Python AST + regex, Java/Go/JS/C++/PHP), HTML, XML, JSON, CSV, PDF, and DOCX are parsed into hierarchical trees
Parser registry — Extensible ParserRegistry with built-in parsers auto-registered; custom parsers via ParserRegistry.register()
Python AST parsing — ast module extracts classes/functions with full signatures (parameters, return types); regex fallback for syntax errors
PDF/DOCX/HTML parsers — Optional parsers via pageindex, python-docx, beautifulsoup4 (install with pip install pytreesearch[all])
GrepFilter — Exact literal/regex matching for precise symbol and keyword search across tree nodes
Source-type routing — Automatic pre-filter selection based on file type (e.g., code files use GrepFilter + FTS5)
Chinese + English — Built-in jieba tokenization for Chinese and regex tokenization for English
Batch indexing — build_index() supports glob patterns for concurrent multi-file processing
Async-first — All core functions are async with sync wrappers available
Config-driven defaults — search() and build_index() read defaults from get_config(), overridable per-call
CLI included — treesearch index and treesearch search commands

FTS5 Standalone

from treesearch import FTS5Index, Document, load_index

data = load_index("indexes/my_doc.json")
doc = Document(doc_id="doc1", doc_name=data["doc_name"], structure=data["structure"])

fts = FTS5Index(db_path="indexes/fts.db")  # persistent, or omit for in-memory
fts.index_documents([doc])

# Simple keyword search
results = fts.search("authentication config", top_k=5)
for r in results:
    print(f"[{r['fts_score']:.4f}] {r['title']}")

# Advanced FTS5 query syntax
results = fts.search("auth", fts_expression='title:auth AND body:config', top_k=5)

# Per-document aggregation
agg = fts.search_with_aggregation("authentication", group_by_doc=True)
for doc_agg in agg:
    print(f"{doc_agg['doc_name']}: {doc_agg['hit_count']} hits, best={doc_agg['best_score']:.4f}")

CLI

# Build indexes from glob pattern
treesearch index --paths "docs/*.md" --add-description

# Search with FTS5
treesearch search --index_dir ./indexes/ --query "How does auth work?" --fts

# Search with persistent FTS5 database
treesearch search --index_dir ./indexes/ --query "auth" --fts --fts-db ./indexes/fts.db

How It Works

Input Documents (MD/TXT/Code/JSON/CSV/HTML/XML/PDF/DOCX)
        │
        ▼
   ┌──────────┐
   │  Indexer  │  ParserRegistry dispatch → parse structure → build tree → generate summaries
   └────┬─────┘    (build_index supports glob for batch processing)
        │  JSON index files
        ▼
   ┌──────────┐
   │  search   │  FTS5/Grep pre-filter → cross-doc scoring → ranked results
   └────┬─────┘
        │  dict result
        ▼
  Ranked nodes with scores and text

FTS5 Pre-Scoring: FTS5Index uses SQLite FTS5 inverted index with MD structure-aware columns (title/summary/body/code/front_matter) and column weighting for fast scoring. Instant results, no LLM needed.

Source-Type Routing: For code files, GrepFilter + FTS5 are combined automatically for precise symbol matching. The pre-filter is selected based on file type via PREFILTER_ROUTING.

Use Cases

Use Case 1: Technical Documentation QA (Best Scenario)

Problem: Your company has 100+ technical docs (API docs, design docs, RFCs), and traditional search can't find the right answers.

from treesearch import build_index, search

# 1. Build index (run once)
docs = await build_index(
    paths=["docs/*.md", "specs/*.txt"],
    output_dir="./indexes"
)

# 2. Search — millisecond response
result = await search(
    query="How to configure Redis cluster?",
    documents=docs,
)

# 3. Results — complete sections, not fragments
for doc in result["documents"]:
    print(f"Doc: {doc['doc_name']}")
    for node in doc["nodes"]:
        print(f"  Section: {node['title']}")
        print(f"  Content: {node['text'][:200]}...")

Why better than traditional RAG?

Finds complete sections, not fragments
Includes section titles as context anchors
Supports hierarchical navigation (parent/child sections)

Use Case 2: Codebase Search

Problem: Want to search for "login-related classes and methods" in a large codebase, but grep only finds lines without structure.

# Index codebase
docs = await build_index(
    paths=["src/**/*.py", "lib/**/*.java"],
    output_dir="./code_indexes"
)

# Search — auto-detects code files, uses AST parsing + GrepFilter
result = await search(
    query="user login authentication",
    documents=docs,
)

# Results example:
# Doc: auth_service.py
#   class UserAuthenticator
#     def login(username, password)
#     def verify_token(token)

Why better than grep/IDE search?

Semantic understanding: Not just keyword matching, understands "login" = "authentication"
Structure-aware: Finds complete classes/methods with docstrings
Precise location: Directly locates to code line numbers

Use Case 3: Long Document QA (Papers/Books)

Problem: Have a 50-page paper, want to ask "What experimental methods are mentioned in Chapter 3?"

docs = await build_index(paths=["paper.pdf"])

result = await search(
    query="experimental methodology",
    documents=docs,
)

# Automatically finds "3.2 Experimental Design" section content

Why better than Ctrl+F?

Semantic matching: Finds synonymous paragraphs for "experimental methods"
Section location: Tells you which chapter and section
Scalable to multi-doc: Search 10 papers simultaneously

Real Case Comparison

Case: Find "How to request GPU machines" in company docs

Traditional way (Ctrl+F):

Search "GPU" → Found 47 matches → Manual review → 10 minutes

TreeSearch way:

result = await search("How to request GPU machines", docs)
# Directly returns "Resource Guide > GPU Request Process" section
# Time: < 100ms

Efficiency gain: 100x

Comparison with Other Solutions

Solution	Pros	Cons	Best For
Ctrl+F	Simple	No semantic understanding, fragmented results	Known keywords
Traditional RAG	Good semantic understanding	Chunking destroys context, slow response	Plain text QA
Vector DB	Similarity search	Requires embedding preprocessing, high cost	Large-scale semantic retrieval
TreeSearch	Preserves structure + Fast + Zero cost	Requires structured documents	Tech docs/Codebase

Benchmark

Document Retrieval (QASPER)

Evaluated on QASPER dataset (50 QA samples, 18 academic papers):

Metric	Embedding (text-embedding-3-small)	TreeSearch FTS5
MRR	0.5403	0.4596
Precision@1	0.3830	0.1915
Recall@5	0.5139	0.6613
Index Time	118.7s	0.0s
Query Time	573ms	0.7ms

Key Findings:

Embedding MRR +18% — Better semantic understanding
TreeSearch Recall@5 +29% — Structure preservation helps recall more relevant content
TreeSearch 780x faster queries — Milliseconds vs seconds
TreeSearch instant indexing — No embedding API calls needed

Code Retrieval (CodeSearchNet)

Evaluated on CodeSearchNet dataset (50 queries, 500 Python corpus):

Metric	Embedding (text-embedding-3-small)	TreeSearch FTS5
MRR	0.9567	0.8469
Hit@1	0.9200	0.8000
Recall@5	1.0000	0.9200
Index Time	73.7s	3.3s
Query Time	620ms	0.8ms

Key Findings:

Embedding MRR +13% — Better code semantic understanding
TreeSearch MRR 84.7% — Strong performance for keyword-based code search
TreeSearch 800x faster queries — Milliseconds vs seconds
TreeSearch 22x faster indexing — No embedding API calls needed

Summary

TreeSearch is not meant to replace embedding-based retrieval, but to provide a zero-cost, ultra-fast alternative. For scenarios prioritizing speed and recall over precision, TreeSearch is the better choice.

Run the benchmarks yourself:

# Document retrieval (QASPER)
python examples/benchmark/qasper_benchmark.py --max-samples 50 --max-papers 20 --with-embedding

# Code retrieval (CodeSearchNet)
python examples/benchmark/codesearchnet_benchmark.py --max-samples 50 --max-corpus 500 --with-embedding

Documentation

Architecture — Design principles and architecture
API Reference — Complete API documentation

Community

GitHub Issues — Submit an issue
WeChat Group — Add WeChat ID xuming624, note "llm", to join the tech group

Citation

If you use TreeSearch in your research, please cite:

@software{xu2026treesearch,
  author = {Xu, Ming},
  title = {TreeSearch: Structure-Aware Document Retrieval Without Embeddings},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/shibing624/TreeSearch}
}

License

Apache License 2.0

Contributing

Contributions are welcome! Please submit a Pull Request.

Acknowledgements

SQLite FTS5 — The full-text search engine powering TreeSearch
VectifyAI/PageIndex — Inspiration for structure-aware indexing and retrieval

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Apr 29, 2026

1.1.0

Apr 22, 2026

1.0.8

Apr 17, 2026

1.0.7

Apr 16, 2026

1.0.6

Apr 16, 2026

1.0.5

Apr 13, 2026

1.0.4

Apr 2, 2026

1.0.3

Apr 1, 2026

1.0.2

Mar 24, 2026

1.0.1

Mar 20, 2026

1.0.0

Mar 20, 2026

0.6.4

Mar 17, 2026

0.6.3

Mar 16, 2026

0.6.2

Mar 14, 2026

0.6.1

Mar 12, 2026

This version

0.6.0

Mar 11, 2026

0.5.7

Mar 11, 2026

0.5.6

Mar 11, 2026

0.5.5

Mar 10, 2026

0.5.4

Mar 10, 2026

0.5.3

Mar 10, 2026

0.5.2

Mar 10, 2026

0.5.1

Mar 10, 2026

0.5.0

Mar 10, 2026

0.2.3

Feb 25, 2026

0.2.1

Feb 25, 2026

0.1.0

Feb 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytreesearch-0.6.0.tar.gz (65.8 kB view details)

Uploaded Mar 11, 2026 Source

File details

Details for the file pytreesearch-0.6.0.tar.gz.

File metadata

Download URL: pytreesearch-0.6.0.tar.gz
Upload date: Mar 11, 2026
Size: 65.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for pytreesearch-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`b6a16b1104c3ba05356926f997ec12463da2a2a66cecd041e2e681b074d74973`
MD5	`585beacbb0a70ee7f9ed55cfea038d2f`
BLAKE2b-256	`306faf3bd1460a9c30051211da492a1ed07831ead3e587e8bf84cb7cbeb1096a`

See more details on using hashes here.

pytreesearch 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TreeSearch: Structure-Aware Document Retrieval

Installation

Quick Start

Why TreeSearch?

Key Advantages

Features

FTS5 Standalone

CLI

How It Works

Use Cases

Use Case 1: Technical Documentation QA (Best Scenario)

Use Case 2: Codebase Search

Use Case 3: Long Document QA (Papers/Books)

Real Case Comparison

Comparison with Other Solutions

Benchmark

Document Retrieval (QASPER)

Code Retrieval (CodeSearchNet)

Summary

Documentation

Community

Citation

License

Contributing

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes