MCP Server for intelligent documentation scraping, vectorization, and semantic search with dynamic ontology extraction
Project description
MCP Doc Builder
Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.
Overview
MCP Doc Builder is a Model Context Protocol (MCP) server that provides:
- Intelligent Web Scraping: LLM-guided crawler that intelligently decides which documentation pages to index
- Semantic Vectorization: Gemini text-embedding-004 for semantic search across documentation
- Dynamic Ontology: Automatically extracts concepts and relationships from documentation
- Knowledge Graph: Neo4j-based storage with full graph traversal capabilities
- Hybrid Search: Combined vector similarity and fulltext search for optimal results
Features
Intelligent Crawling
- LLM-powered link evaluation decides which pages to follow
- Respects rate limits to avoid overwhelming documentation servers
- Configurable depth (1-5 hops from root URL)
- Smart content extraction with trafilatura
Semantic Search
- Gemini text-embedding-004 for 768-dimensional vectors
- Neo4j Vector Index for fast similarity search
- Fulltext search with Lucene
- Hybrid search combining both methods
Dynamic Ontology
- Automatic concept extraction (APIs, patterns, entities)
- Relationship inference (uses, extends, requires, etc.)
- Chunk-to-concept linking
- Concept co-occurrence analysis
MCP Integration
- 6 tools for complete documentation management
- Resources for graph exploration
- Workflow prompts for common tasks
Quick Start
1. Prerequisites
- Python 3.11+
- Docker (for Neo4j)
- LiteLLM Gateway or Gemini API key
2. Installation
You can install doc-builder-mcp globally using pipx (recommended) or in a local virtual environment.
Option 1: One-Line Install (Recommended)
# Install the package
pipx install doc-builder-mcp
# Run the interactive Setup Wizard
doc-mcp-setup
The wizard will:
- Check for Docker and Neo4j.
- Ask for your LiteLLM / Gemini Credentials.
- Configure the LLM Mode (LiteLLM vs Gemini Direct).
- Generate a secure
.envfile.
❓ Don't have pipx? Click here to install it
macOS:
brew install pipx
pipx ensurepath
Windows:
winget install pipx
pipx ensurepath
Linux (Debian/Ubuntu):
sudo apt install pipx
pipx ensurepath
Restart your terminal after installing pipx.
Alternative: Standard Pip
If you prefer not to use pipx:
pip install doc-builder-mcp
doc-mcp-setup
Option 2: Manual Development Setup
If you want to contribute or modify the code:
git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setup
3. Setup
Run the interactive setup wizard:
doc-mcp-setup
Or manually configure:
cp ../.env.example ../.env
# Edit .env with your configuration
4. Start Neo4j
Start the Neo4j database natively with docker or using the provided Makefile:
make neo4j-up
This uses the docker-compose.yml to start the Neo4j instance.
5. Run the Server
# STDIO mode (for IDE integration)
make server-stdio
# HTTP mode (for API access)
make server
Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
NEO4J_URI |
Neo4j connection URI | bolt://localhost:7688 |
NEO4J_USERNAME |
Neo4j username | neo4j |
NEO4J_PASSWORD |
Neo4j password | - |
LLM_MODE |
litellm, gemini_direct, or both |
litellm |
LITELLM_BASE_URL |
LiteLLM Gateway URL | - |
LITELLM_API_KEY |
LiteLLM API key | - |
LITELLM_MODEL |
Model name | gemini-2.5-flash |
CRAWLER_MAX_DEPTH |
Maximum crawl depth | 2 |
CRAWLER_RATE_LIMIT |
Seconds between requests | 1.0 |
CRAWLER_MAX_PAGES |
Max pages per source | 500 |
MCP Tools
doc_ingest
Ingest and index a documentation website.
{
"url": "https://nextjs.org/docs",
"name": "Next.js Docs",
"max_depth": 2
}
doc_search
Search indexed documentation.
{
"query": "how to use React hooks",
"limit": 10,
"search_mode": "hybrid"
}
doc_context
Get comprehensive context for a topic.
{
"topic": "authentication in Next.js",
"include_related": true
}
doc_sources
List all indexed documentation sources.
doc_refresh
Refresh/re-index a documentation source.
{
"source_id": "source_abc123",
"force": false
}
doc_ontology
Explore extracted concepts and relationships.
{
"source_id": "source_abc123",
"concept": "useState"
}
IDE Integration
You can use this MCP server with any compatible AI coding assistant.
Antigravity (Google Deepmind)
- Open Antigravity Settings or locate your configuration file.
- Add the MCP server configuration:
{
"mcpServers": {
"doc-builder": {
"command": "doc-mcp",
"args": [],
"env": {
"NEO4J_URI": "bolt://localhost:7688",
"NEO4J_PASSWORD": "your-password",
"LITELLM_API_KEY": "your-key"
}
}
}
}
Recommended Antigravity Custom Rule
To maximize the utility of doc-builder, add this specific rule to your Antigravity global profile or project guidelines (e.g., in .gemini/rules.md or global settings):
Documentation Strategy Rule: "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the
doc_ingestMCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always querydoc_searchordoc_contextto understand the best practices and latest syntax."
Cursor
Cursor supports MCP natively. To add the server:
- Open Cursor Settings (Cmd/Ctrl + Shift + J) > Features > MCP.
- Click + Add new MCP server.
- Set the Type to
command. - Set the Name to
doc-builder. - Set the Command to
doc-mcp(assuming you installed viapipx). - Add the necessary environment variables (
NEO4J_PASSWORD,LITELLM_API_KEY, etc.) directly in the Cursor UI environment section.
VS Code (with Claude Dev / Roo Code)
If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:
- Open the MCP configuration file (usually found at
~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.jsonon Mac). - Add the server entry:
{
"mcpServers": {
"doc-builder": {
"command": "doc-mcp",
"args": [],
"env": {
"NEO4J_URI": "bolt://localhost:7688",
"NEO4J_PASSWORD": "your-password",
"LITELLM_API_KEY": "your-key"
}
}
}
}
Architecture
mcp-doc-builder/
├── docker-compose.yml # Neo4j container
├── .env.example # Configuration template
└── server/
├── pyproject.toml # Python package
└── src/doc_builder/
├── main.py # MCP server entry
├── config.py # Settings
├── cli/ # Setup wizard & status
├── crawler/ # Web scraping
│ ├── spider.py # Async crawler
│ ├── parser.py # HTML parsing
│ └── agent.py # LLM link evaluation
├── vector/ # Vectorization
│ ├── embedder.py # Gemini embeddings
│ ├── chunker.py # Smart chunking
│ └── indexer.py # Neo4j vector index
├── ontology/ # Knowledge extraction
│ ├── extractor.py # Concept extraction
│ ├── metatag.py # Metatag processing
│ └── linker.py # Relationship building
├── kg/ # Neo4j graph
│ ├── neo4j.py # Async client
│ ├── repo.py # Query repository
│ └── schema.cypher # Database schema
├── llm/ # LLM integration
│ ├── client.py # LiteLLM wrapper
│ └── prompts/ # Prompt templates
├── mcp/ # MCP protocol
│ ├── tools.py # Tool definitions
│ ├── resources.py # Resource handlers
│ └── prompts.py # Workflow prompts
└── security/ # Auth & validation
Graph Schema
Nodes (Doc* prefixed for namespace separation)
- DocSource: Documentation root (URL, name, status)
- DocPage: Individual pages with metadata
- DocChunk: Vectorized content chunks with embeddings
- DocConcept: Extracted concepts (APIs, patterns, entities)
- DocMetatag: Page metatags (og:, twitter:, etc.)
- DocCrawlJob: Crawl job tracking
Relationships
(DocSource)-[:CONTAINS]->(DocPage)(DocPage)-[:LINKS_TO]->(DocPage)(DocPage)-[:HAS_CHUNK]->(DocChunk)(DocChunk)-[:MENTIONS]->(DocConcept)(DocConcept)-[:RELATES_TO]->(DocConcept)
CLI Commands
# Interactive setup
doc-mcp-setup
# Health check
doc-mcp-status --doctor
# Run server
doc-mcp
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Type checking
mypy src/
# Linting
ruff check src/
License
MIT
Related Projects
- MCP KG Memory: Knowledge graph memory for AI coding assistants
- Model Context Protocol: MCP specification
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc_builder_mcp-0.1.3.tar.gz.
File metadata
- Download URL: doc_builder_mcp-0.1.3.tar.gz
- Upload date:
- Size: 78.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
623f80b30bb74d17e1ba49a91de2926e1d7360b3910a645cb0c3b04d399264c9
|
|
| MD5 |
c34ea30851fada835a23002811cb0c46
|
|
| BLAKE2b-256 |
a30cec1a178650a2080dbdd049f8d0a69630330c1f68e343cee9a3c07e481370
|
File details
Details for the file doc_builder_mcp-0.1.3-py3-none-any.whl.
File metadata
- Download URL: doc_builder_mcp-0.1.3-py3-none-any.whl
- Upload date:
- Size: 96.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c909b692bf1061f06ed73a6ce0ce9756fb384e565fd06e671ea508fc8ceab008
|
|
| MD5 |
95a302905d0b034bce08a6864c23d354
|
|
| BLAKE2b-256 |
44f35eeef3a6cba6958f577c1d674fddc6b7b0a2718e0b21d2d07dd8042ebe5a
|