Skip to main content

Structure-aware document retrieval. FTS5/BM25 keyword matching over document trees.

Project description

doclens

Structure-aware document retrieval — FTS5/BM25 keyword search over document trees, with an interactive TUI and a PWA Web UI.

PyPI version Python License

doclens parses documents into tree structures (headings, classes, functions…) and searches them with FTS5/BM25 keyword matching — no embeddings, no chunking, no vector DB required. Works entirely offline.


Features

Structure-aware search Returns results anchored to document headings, code classes, or function definitions — not orphaned line fragments
Multi-format Markdown, PDF, DOCX, PPTX, Excel, HTML, JSON, CSV, code (Python AST + tree-sitter)
Two UIs Textual TUI (terminal) and Lit + Shoelace PWA (browser)
LLM-augmented QA Send search results to Anthropic Claude for natural-language answers
Background watching Auto-reindexes changed files via watchdog
Web search Fetch + extract public web pages as markdown before searching

Installation

pip install doclens

Requires Python ≥ 3.10.

Quick setup:

# Index your documents
doclens index --force

# Search from CLI
doclens search "authentication"

# Or launch the Web UI (opens browser automatically)
doclens gui

CLI Reference

doclens [--workdir DIR] <command>
Command Description
doclens search <query…> Keyword search across indexed documents
doclens search_v2 '<json>' Structured search: AND / OR / NOT / PHRASE operators
doclens ai <message…> Send a message to the Claude agent
doclens index [--force] Build or update the document index
doclens status Show index statistics and system status
doclens gui [--port PORT] Launch the Web UI (PWA)
doclens read_document --path <path> Read a document with structure info
doclens web <query…> Search the live web
doclens webfetch <url> Extract a web page as markdown
doclens grep <pattern> Ripgrep-style regex search

Quick Start

1. Index your documents

# Index the current directory
doclens index --force

# Or specify a working directory
doclens --workdir /path/to/project index

doclens automatically discovers supported files (.md, .py, .pdf, .docx, .xlsx, …) and skips common ignore patterns (.git, node_modules, __pycache__, .venv).

2. Search

doclens search "authentication flow"
doclens search "量子 计算"          # Chinese supported via jieba

# Structured query
doclens search_v2 '{"type": "and", "terms": ["auth", "token"]}'

3. Interactive TUI

doclens

Opens the full terminal UI with live preview, command history, and keyboard navigation.

4. Web UI

doclens gui
# INFO: Uvicorn running on http://127.0.0.1:7860

Browser opens automatically. Port may vary if 7860 is in use — check the startup log.

5. Ask the AI

doclens ai "How does the authentication system work?"

doclens first retrieves relevant document sections, then sends them to Anthropic Claude as context for a grounded answer.


Configuration

doclens reads .env in the project root. Copy and customize:

cp doclens/.env.example .env

Key variables:

Variable Default Description
CORTEX_SEARCH_PATH . Root directory to index and search
CORTEX_DB_PATH .cortex/sessions.db SQLite database path
ANTHROPIC_API_KEY Required for ai and web commands
ANTHROPIC_BASE_URL Custom API endpoint (optional)

Architecture

┌─────────────────────────────────────────────┐
│                  TUI (Textual)              │
│  ┌───────────────────────────────────────┐  │
│  │  HeaderBar │ ContentArea │ InputBox   │  │
│  └───────────────────────────────────────┘  │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│           Web UI (Lit + Shoelace PWA)      │
│         FastAPI + SSE streaming             │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│         IndexManager + Scoring              │
│    TreeSearch (FTS5 + BM25)                │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│    treesearch/  —  parsers, indexer, FTS5  │
│    planify/     —  AI agent runner          │
└─────────────────────────────────────────────┘
  • treesearch: Powers the indexing and retrieval engine (FTS5/BM25 over document trees)
  • planify: Drives the AI agent, session management, and tool execution
  • doclens: Ties them together — CLI, TUI, Web UI, event bus, and file watcher

License

Apache License 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doclens-1.1.3.tar.gz (417.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doclens-1.1.3-py3-none-any.whl (479.2 kB view details)

Uploaded Python 3

File details

Details for the file doclens-1.1.3.tar.gz.

File metadata

  • Download URL: doclens-1.1.3.tar.gz
  • Upload date:
  • Size: 417.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for doclens-1.1.3.tar.gz
Algorithm Hash digest
SHA256 8a9980c00f1b16a1a089de3636855b68a2e2e12c4d9f2cf459a4d040700009b8
MD5 9b44851590ed3ac8da6373981fe8993a
BLAKE2b-256 c28ee41061eb43336ebd253ff709f20831553e62cc0ae4f9832a4d0705f0f020

See more details on using hashes here.

File details

Details for the file doclens-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: doclens-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 479.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for doclens-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 67aa7678ebaf31ddb25b03e8d4c0786f90e9893d086102fccd68de2290f2de10
MD5 579d6bbd8672d9c31ccd0cb29bc55b37
BLAKE2b-256 a5f46b3bf6879b4f77c0e8b3dcbd973666a2270aff4adbfc6c66dc9ff09f3598

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page