Skip to main content

MCP Server for intelligent documentation scraping, vectorization, and semantic search with dynamic ontology extraction

Project description

MCP Doc Builder

Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.

Overview

MCP Doc Builder is a Model Context Protocol (MCP) server that provides:

  • Intelligent Web Scraping: LLM-guided crawler that intelligently decides which documentation pages to index
  • Semantic Vectorization: Gemini text-embedding-004 for semantic search across documentation
  • Dynamic Ontology: Automatically extracts concepts and relationships from documentation
  • Knowledge Graph: Neo4j-based storage with full graph traversal capabilities
  • Hybrid Search: Combined vector similarity and fulltext search for optimal results

Features

Intelligent Crawling

  • LLM-powered link evaluation decides which pages to follow
  • Respects rate limits to avoid overwhelming documentation servers
  • Configurable depth (1-5 hops from root URL)
  • Smart content extraction with trafilatura

Semantic Search

  • Gemini text-embedding-004 for 768-dimensional vectors
  • Neo4j Vector Index for fast similarity search
  • Fulltext search with Lucene
  • Hybrid search combining both methods

Dynamic Ontology

  • Automatic concept extraction (APIs, patterns, entities)
  • Relationship inference (uses, extends, requires, etc.)
  • Chunk-to-concept linking
  • Concept co-occurrence analysis

MCP Integration

  • 6 tools for complete documentation management
  • Resources for graph exploration
  • Workflow prompts for common tasks

Quick Start

1. Prerequisites

  • Python 3.11+
  • Docker (for Neo4j)
  • LiteLLM Gateway or Gemini API key

2. Installation

You can install doc-builder-mcp globally using pipx (recommended) or in a local virtual environment.

Option 1: One-Line Install (Recommended)

# Install the package
pipx install doc-builder-mcp

# Run the interactive Setup Wizard
doc-mcp-setup

The wizard will:

  1. Check for Docker and Neo4j.
  2. Ask for your LiteLLM / Gemini Credentials.
  3. Configure the LLM Mode (LiteLLM vs Gemini Direct).
  4. Generate a secure .env file.
❓ Don't have pipx? Click here to install it

macOS:

brew install pipx
pipx ensurepath

Windows:

winget install pipx
pipx ensurepath

Linux (Debian/Ubuntu):

sudo apt install pipx
pipx ensurepath

Restart your terminal after installing pipx.

Alternative: Standard Pip

If you prefer not to use pipx:

pip install doc-builder-mcp
doc-mcp-setup

Option 2: Manual Development Setup

If you want to contribute or modify the code:

git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setup

3. Setup

Run the interactive setup wizard:

doc-mcp-setup

Or manually configure:

cp ../.env.example ../.env
# Edit .env with your configuration

4. Start Neo4j

Start the Neo4j database natively with docker or using the provided Makefile:

make neo4j-up

This uses the docker-compose.yml to start the Neo4j instance.

5. Run the Server

# STDIO mode (for IDE integration)
make server-stdio

# HTTP mode (for API access)
make server

Configuration

Environment Variables

Variable Description Default
NEO4J_URI Neo4j connection URI bolt://localhost:7688
NEO4J_USERNAME Neo4j username neo4j
NEO4J_PASSWORD Neo4j password -
LLM_MODE litellm, gemini_direct, or both litellm
LITELLM_BASE_URL LiteLLM Gateway URL -
LITELLM_API_KEY LiteLLM API key -
LITELLM_MODEL Model name gemini-2.5-flash
CRAWLER_MAX_DEPTH Maximum crawl depth 2
CRAWLER_RATE_LIMIT Seconds between requests 1.0
CRAWLER_MAX_PAGES Max pages per source 500

MCP Tools

doc_ingest

Ingest and index a documentation website.

{
  "url": "https://nextjs.org/docs",
  "name": "Next.js Docs",
  "max_depth": 2
}

doc_search

Search indexed documentation.

{
  "query": "how to use React hooks",
  "limit": 10,
  "search_mode": "hybrid"
}

doc_context

Get comprehensive context for a topic.

{
  "topic": "authentication in Next.js",
  "include_related": true
}

doc_sources

List all indexed documentation sources.

doc_refresh

Refresh/re-index a documentation source.

{
  "source_id": "source_abc123",
  "force": false
}

doc_ontology

Explore extracted concepts and relationships.

{
  "source_id": "source_abc123",
  "concept": "useState"
}

IDE Integration

You can use this MCP server with any compatible AI coding assistant.

Antigravity (Google Deepmind)

  1. Open Antigravity Settings or locate your configuration file.
  2. Add the MCP server configuration:
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Recommended Antigravity Custom Rule

To maximize the utility of doc-builder, add this specific rule to your Antigravity global profile or project guidelines (e.g., in .gemini/rules.md or global settings):

Documentation Strategy Rule: "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the doc_ingest MCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always query doc_search or doc_context to understand the best practices and latest syntax."

Cursor

Cursor supports MCP natively. To add the server:

  1. Open Cursor Settings (Cmd/Ctrl + Shift + J) > Features > MCP.
  2. Click + Add new MCP server.
  3. Set the Type to command.
  4. Set the Name to doc-builder.
  5. Set the Command to doc-mcp (assuming you installed via pipx).
  6. Add the necessary environment variables (NEO4J_PASSWORD, LITELLM_API_KEY, etc.) directly in the Cursor UI environment section.

VS Code (with Claude Dev / Roo Code)

If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:

  1. Open the MCP configuration file (usually found at ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json on Mac).
  2. Add the server entry:
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}

Architecture

mcp-doc-builder/
├── docker-compose.yml        # Neo4j container
├── .env.example              # Configuration template
└── server/
    ├── pyproject.toml        # Python package
    └── src/doc_builder/
        ├── main.py           # MCP server entry
        ├── config.py         # Settings
        ├── cli/              # Setup wizard & status
        ├── crawler/          # Web scraping
        │   ├── spider.py     # Async crawler
        │   ├── parser.py     # HTML parsing
        │   └── agent.py      # LLM link evaluation
        ├── vector/           # Vectorization
        │   ├── embedder.py   # Gemini embeddings
        │   ├── chunker.py    # Smart chunking
        │   └── indexer.py    # Neo4j vector index
        ├── ontology/         # Knowledge extraction
        │   ├── extractor.py  # Concept extraction
        │   ├── metatag.py    # Metatag processing
        │   └── linker.py     # Relationship building
        ├── kg/               # Neo4j graph
        │   ├── neo4j.py      # Async client
        │   ├── repo.py       # Query repository
        │   └── schema.cypher # Database schema
        ├── llm/              # LLM integration
        │   ├── client.py     # LiteLLM wrapper
        │   └── prompts/      # Prompt templates
        ├── mcp/              # MCP protocol
        │   ├── tools.py      # Tool definitions
        │   ├── resources.py  # Resource handlers
        │   └── prompts.py    # Workflow prompts
        └── security/         # Auth & validation

Graph Schema

Nodes (Doc* prefixed for namespace separation)

  • DocSource: Documentation root (URL, name, status)
  • DocPage: Individual pages with metadata
  • DocChunk: Vectorized content chunks with embeddings
  • DocConcept: Extracted concepts (APIs, patterns, entities)
  • DocMetatag: Page metatags (og:, twitter:, etc.)
  • DocCrawlJob: Crawl job tracking

Relationships

  • (DocSource)-[:CONTAINS]->(DocPage)
  • (DocPage)-[:LINKS_TO]->(DocPage)
  • (DocPage)-[:HAS_CHUNK]->(DocChunk)
  • (DocChunk)-[:MENTIONS]->(DocConcept)
  • (DocConcept)-[:RELATES_TO]->(DocConcept)

CLI Commands

# Interactive setup
doc-mcp-setup

# Health check
doc-mcp-status --doctor

# Run server
doc-mcp

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/

License

MIT

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_builder_mcp-0.1.1.tar.gz (78.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_builder_mcp-0.1.1-py3-none-any.whl (96.7 kB view details)

Uploaded Python 3

File details

Details for the file doc_builder_mcp-0.1.1.tar.gz.

File metadata

  • Download URL: doc_builder_mcp-0.1.1.tar.gz
  • Upload date:
  • Size: 78.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for doc_builder_mcp-0.1.1.tar.gz
Algorithm Hash digest
SHA256 dc92573b31dae22dc7af8d0c0e89e3fba636cc343d79bdf13e59e9866069022d
MD5 580b27e0d2ea25e5fb4377722baabfc4
BLAKE2b-256 a06374e9ca576e1201a733cad6d13bd82b1d0f272fa9d0f3863d6acc6816d78f

See more details on using hashes here.

File details

Details for the file doc_builder_mcp-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for doc_builder_mcp-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4d11a3a0328401670d67ede4515a91b1fc93533626618df8dd565ec36af8854e
MD5 c804c8aa1b0a81422d5ec47f26b56f2a
BLAKE2b-256 ea125b363bf0ec50d0a6ded5984ab97204e7aaa1dfe83e477955f4045a6e3355

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page