MCP server with RAG for llms.txt documentation
Project description
LLMDoc
MCP server with RAG (BM25) for llms.txt documentation. Provides semantic search across documentation sources with automatic background refresh.
Features
- llms.txt support - Automatically parses and indexes documentation from llms.txt files
- BM25 search - Fast, keyword-based retrieval with relevance scoring and stopword filtering
- Named sources - Configure sources with names like
fast_mcp:https://...for easy filtering - Source filtering - Search across all sources or filter by specific source name
- Persistent storage - DuckDB-based index that survives restarts
- Background refresh - Configurable auto-refresh interval (default: 6 hours)
- Source attribution - Every search result includes source name and URL
Quick Start
- Add to Claude Code (
~/.claude/claude_code_config.json):
{
"mcpServers": {
"llmdoc": {
"command": "uvx",
"args": ["llmdoc"],
"env": {
"LLMDOC_SOURCES": "fast_mcp:https://gofastmcp.com/llms.txt"
}
}
}
}
-
Restart Claude Code - the server will automatically fetch and index documentation.
-
Ask Claude questions like "How do I create a tool in FastMCP?" and it will search the indexed docs.
What is llms.txt?
llms.txt is a specification for providing LLM-friendly documentation. Websites add a /llms.txt markdown file to their root directory containing curated, concise content optimized for AI consumption. LLMDoc indexes these files and their linked documents to enable semantic search.
Example sources:
Installation
# Run directly with uvx (no install needed)
uvx llmdoc
# Or install with uv
uv tool install llmdoc
# Or install with pip
pip install llmdoc
# Or install with pipx
pipx install llmdoc
Configuration
Source Format
Sources can be specified in two formats:
- Named:
name:url- e.g.,fast_mcp:https://gofastmcp.com/llms.txt - Unnamed: Just the URL - name is auto-generated from domain
Named sources allow you to filter search results by source name.
Environment Variables
# Comma-separated list of sources (named or unnamed)
export LLMDOC_SOURCES="fast_mcp:https://gofastmcp.com/llms.txt,pydantic_ai:https://ai.pydantic.dev/llms.txt"
# Optional: Custom database path (default: ~/.llmdoc/index.db)
export LLMDOC_DB_PATH="/path/to/index.db"
# Optional: Refresh interval in hours (default: 6)
export LLMDOC_REFRESH_INTERVAL="6"
# Optional: Max concurrent document fetches (default: 5)
export LLMDOC_MAX_CONCURRENT="5"
# Optional: Skip refresh on startup (default: false)
export LLMDOC_SKIP_STARTUP_REFRESH="true"
Config File
Create llmdoc.json in the working directory:
{
"sources": [
"fast_mcp:https://gofastmcp.com/llms.txt",
"pydantic_ai:https://ai.pydantic.dev/llms.txt"
],
"db_path": "~/.llmdoc/index.db",
"refresh_interval_hours": 6,
"max_concurrent_fetches": 5,
"skip_startup_refresh": false
}
Or with explicit name/url objects:
{
"sources": [
{"name": "fast_mcp", "url": "https://gofastmcp.com/llms.txt"},
{"name": "pydantic_ai", "url": "https://ai.pydantic.dev/llms.txt"}
]
}
Running the Server
LLMDoc uses stdio transport and is designed to be launched by MCP clients. Configure it in your MCP client (see below), and the client will start the server automatically.
For manual testing:
# Using uvx
uvx llmdoc
# Or as module
python -m llmdoc
MCP Tools
search_docs(query, limit, source)- Search documentation and return relevant passages with source URLs. Optionalsourceparameter filters by source name (e.g.,fast_mcp)get_doc(url, offset, limit)- Get document content with pagination support for large documents. Parameters:offset(default: 0) start position in bytes,limit(default: 50000, max: 100000) max bytes per call. Returns pagination metadata (has_more,total_length)get_doc_excerpt(url, query, max_chunks, context_chars)- Get relevant excerpts from a large document matching a querylist_sources()- List all configured documentation sources with statisticsrefresh_sources()- Manually trigger a refresh of all documentation
MCP Resources
doc://sources- Returns JSON with configured sources list and refresh interval
Adding to MCP Clients
Claude Code
Add to ~/.claude/claude_code_config.json:
{
"mcpServers": {
"llmdoc": {
"command": "uvx",
"args": ["llmdoc"],
"env": {
"LLMDOC_SOURCES": "fast_mcp:https://gofastmcp.com/llms.txt,pydantic_ai:https://ai.pydantic.dev/llms.txt"
}
}
}
}
Standard MCP Configuration
Add to your MCP client's configuration file:
{
"mcpServers": {
"llmdoc": {
"command": "uvx",
"args": ["llmdoc"],
"env": {
"LLMDOC_SOURCES": "fast_mcp:https://gofastmcp.com/llms.txt"
}
}
}
}
Example Usage
Once configured, the LLM can use these tools:
User: How do I create a tool in FastMCP?
LLM: [calls search_docs("create tool FastMCP")]
Result:
[
{
"title": "Tools",
"snippet": "Creating a tool is as simple as decorating a Python function with @mcp.tool...",
"url": "https://gofastmcp.com/servers/tools.md",
"source": "fast_mcp",
"source_url": "https://gofastmcp.com/llms.txt",
"score": 12.5
}
]
Filtering by Source
You can filter results to a specific documentation source:
User: How do I create an agent in PydanticAI?
LLM: [calls search_docs("create agent", source="pydantic_ai")]
Result:
[
{
"title": "Agents",
"snippet": "Agents are the primary interface for interacting with LLMs in PydanticAI...",
"url": "https://ai.pydantic.dev/agents.md",
"source": "pydantic_ai",
"source_url": "https://ai.pydantic.dev/llms.txt",
"score": 10.2
}
]
Getting Full Document Content
Use get_doc to retrieve document content (supports pagination for large documents):
LLM: [calls get_doc("https://ai.pydantic.dev/agents.md")]
Result:
{
"title": "Agents",
"content": "# Agents\n\nAgents are the primary interface for interacting with LLMs in PydanticAI...",
"url": "https://ai.pydantic.dev/agents.md",
"source": "pydantic_ai",
"source_url": "https://ai.pydantic.dev/llms.txt",
"offset": 0,
"length": 5432,
"total_length": 5432,
"has_more": false
}
Architecture
+------------------+
| MCP Client |
| (Claude, Cursor) |
+--------+---------+
| stdio
v
+------------------+ +------------------+ +------------------+
| FastMCP Server |---->| Document Store |<----|Document Fetcher |
| | | (DuckDB) | | (async HTTP) |
| - search_docs | | | | |
| - get_doc | | - Persistence | | - llms.txt parse |
| - list_sources | | - Deduplication | | - HTML→Markdown |
| - refresh | | - Change detect | | - Concurrent |
+--------+---------+ +------------------+ +------------------+
|
v
+------------------+
| BM25 Index |
| (in-memory) |
| |
| - Chunking |
| - Tokenization |
| - Scoring |
+------------------+
LLMDoc fetches documentation from llms.txt sources, stores it in DuckDB, and provides fast BM25 search through the MCP protocol.
How It Works
Document Fetching
When configured with documentation sources, LLMDoc:
- Parses llms.txt files to discover all linked documents
- Fetches each document concurrently (with rate limiting)
- Converts HTML pages to Markdown automatically
- Extracts titles from the first H1 heading
Indexing
Documents are processed for efficient search:
- Chunking: Large documents are split into ~500 character chunks at sentence boundaries
- Tokenization: Text is lowercased and stopwords are removed
- Indexing: BM25 algorithm indexes all chunks for relevance scoring
Search
When you search:
- Your query is tokenized the same way as documents
- BM25 scores each chunk against your query
- Results are deduplicated by document URL
- Top results are returned with relevance scores and snippets
Background Refresh
LLMDoc automatically keeps documentation up-to-date:
- Checks for staleness on startup
- Refreshes every 6 hours (configurable)
- Uses content hashing to skip unchanged documents
- Removes documents no longer in llms.txt
Technical Details
BM25 Search Algorithm
LLMDoc uses the BM25Okapi algorithm from the rank_bm25 library. Key characteristics:
- Term frequency saturation: Diminishing returns for repeated terms
- Document length normalization: Shorter documents aren't unfairly penalized
- IDF weighting: Rare terms are weighted higher than common ones
The implementation is thread-safe using threading.RLock() for concurrent access.
Chunking Strategy
Documents are chunked using a multi-level approach:
- Paragraph splitting: First split on double newlines (
\n\n) - Sentence-boundary aware: Long paragraphs split at
.!?followed by whitespace - Overlap: 100 character overlap between chunks maintains context
Configuration:
chunk_size: 500 characters (default)chunk_overlap: 100 characters (default)
Database Schema
DuckDB stores documents with this schema:
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_name TEXT NOT NULL, -- e.g., 'fast_mcp'
source_url TEXT NOT NULL, -- llms.txt URL
doc_url TEXT NOT NULL UNIQUE, -- document URL
title TEXT,
content TEXT NOT NULL,
content_hash TEXT NOT NULL, -- SHA256 for change detection
updated_at TIMESTAMP NOT NULL
)
Indexes on source_url and source_name for efficient filtering.
Concurrency Model
LLMDoc supports multiple concurrent instances:
- Read operations: Multiple instances can search simultaneously (read-only DuckDB mode)
- Write operations: Single instance holds exclusive lock during refresh
- Graceful handling: If refresh is locked, operation skips with status message
Document fetching uses asyncio.Semaphore to limit concurrent HTTP requests (default: 5).
Stopwords
213 English stopwords are filtered during tokenization, including:
- Articles: a, an, the
- Prepositions: in, on, at, by, for, with, about, etc.
- Pronouns: I, you, he, she, it, we, they, etc.
- Auxiliaries: is, are, was, were, be, been, being, etc.
- Common verbs: have, has, had, do, does, did, etc.
License
MIT License - see LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmdoc-0.1.11.tar.gz.
File metadata
- Download URL: llmdoc-0.1.11.tar.gz
- Upload date:
- Size: 22.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60a722c66216d287b1ed347f2addfa6fbf57ea55f14222657223f9ded26e27b3
|
|
| MD5 |
f613fab8d2844ac0f777c9be620a6506
|
|
| BLAKE2b-256 |
30f5b9c1168d857f465eff664174e555f18a7138b2ffc5d1ae8390da92914778
|
Provenance
The following attestation bundles were made for llmdoc-0.1.11.tar.gz:
Publisher:
publish.yml on bigbag/llmdoc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmdoc-0.1.11.tar.gz -
Subject digest:
60a722c66216d287b1ed347f2addfa6fbf57ea55f14222657223f9ded26e27b3 - Sigstore transparency entry: 804598659
- Sigstore integration time:
-
Permalink:
bigbag/llmdoc@e98b627846d4631d8f73fc7300aa16c48c8c28f8 -
Branch / Tag:
refs/tags/v0.1.11 - Owner: https://github.com/bigbag
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e98b627846d4631d8f73fc7300aa16c48c8c28f8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file llmdoc-0.1.11-py3-none-any.whl.
File metadata
- Download URL: llmdoc-0.1.11-py3-none-any.whl
- Upload date:
- Size: 25.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d47d067de0b9f1b0c49084a3c4da641b963a5b188e25ff7d9b5c142d17963eb4
|
|
| MD5 |
b8d4065f4ac5c898e1dee064dfd689f3
|
|
| BLAKE2b-256 |
79bf7f5559d6b4ecad8475c6fabecc28a2e89ed6c9999492033def573301e937
|
Provenance
The following attestation bundles were made for llmdoc-0.1.11-py3-none-any.whl:
Publisher:
publish.yml on bigbag/llmdoc
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llmdoc-0.1.11-py3-none-any.whl -
Subject digest:
d47d067de0b9f1b0c49084a3c4da641b963a5b188e25ff7d9b5c142d17963eb4 - Sigstore transparency entry: 804598664
- Sigstore integration time:
-
Permalink:
bigbag/llmdoc@e98b627846d4631d8f73fc7300aa16c48c8c28f8 -
Branch / Tag:
refs/tags/v0.1.11 - Owner: https://github.com/bigbag
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e98b627846d4631d8f73fc7300aa16c48c8c28f8 -
Trigger Event:
push
-
Statement type: