Skip to main content

MCP server for creating rich lexical graphs from PDFs in Neo4j. Supports multiple parse modes (pymupdf, docling, page-image), pluggable chunking strategies, document versioning, and structural verification.

Project description

mcp-neo4j-lexical-graph

MCP server for creating rich lexical graphs from PDF documents in Neo4j. Designed for Neo4j sales engineers to quickly build PDF-to-graph and GraphRAG agent chatbot POCs.

Supports four parsing strategies (PyMuPDF, Docling, page-image, VLM block ordering), pluggable chunking, document versioning, VLM-based description generation, and vector/fulltext search with Neo4j 2026.01 native VECTOR type and document-name prefiltering.

Graph Model

graph LR
    Doc[Document] -->|HAS_PAGE| Page
    Doc -->|HAS_ELEMENT| Img[Image]
    Doc -->|HAS_ELEMENT| Tbl[Table]
    Doc -->|HAS_SECTION| Sec[Section]
    Sec -->|HAS_SUBSECTION| Sec
    Chunk -->|PART_OF| Doc
    Chunk -->|NEXT_CHUNK| Chunk
    Chunk -->|HAS_ELEMENT| Img
    Chunk -->|HAS_ELEMENT| Tbl
    Page -->|NEXT_PAGE| Page

Node types depend on the parse mode used. See Parse Modes below.

Parse Modes

Mode Nodes created Best for
pymupdf Document, Chunk, Image, Table General-purpose text + visual extraction
docling Document, Page, Element, Section, (then Chunk via chunking tool) Complex layouts, section-aware chunking
page_image Document, Page Slides/presentations for VLM-based extraction
vlm_blocks Document, Page, Element, Section, (then Chunk via chunking tool) Experimental. Complex layouts without docling dependency (uses VLM API). Prefer docling for production use.

Quick Start

cd mcp-neo4j-lexical-graph
uv sync

Cursor MCP Configuration

Add to your .cursor/mcp.json:

{
  "mcpServers": {
    "neo4j-lexical-graph": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/mcp-neo4j-lexical-graph",
        "run",
        "mcp-neo4j-lexical-graph"
      ],
      "env": {
        "NEO4J_URI": "bolt://localhost:7687",
        "NEO4J_USERNAME": "neo4j",
        "NEO4J_PASSWORD": "your-password",
        "NEO4J_DATABASE": "neo4j",
        "EMBEDDING_MODEL": "text-embedding-3-small",
        "EXTRACTION_MODEL": "gpt-5-mini",
        "OPENAI_API_KEY": "sk-..."
      }
    }
  }
}

Tools

Tools must be called in a specific order — which tools to call depends on the parse mode and document type. See the workflow table below.

Workflow Order

# Tool pymupdf docling page_image vlm_blocks Notes
1 create_lexical_graph Always first. Async — returns job_id.
2 check_processing_status Poll until complete after any async op.
3 cancel_job opt opt opt opt Only if aborting a running job.
4 chunk_lexical_graph Required for docling/vlm_blocks/page_image. Integrated into create for pymupdf.
5 list_documents Confirm ingestion, get document IDs.
6 verify_lexical_graph opt opt ✗ never opt Single-doc spot-check only. Never for page_image (base64 flood).
7 assign_section_hierarchy opt opt For structured docs with nested sections. Uses EXTRACTION_MODEL.
8 generate_chunk_descriptions recommended¹ recommended¹ required recommended¹ VLM descriptions for Image/Table/Page nodes. Required before embed_chunks for page_image.
9 embed_chunks Synchronous. Call with no parameters — auto-detects textDescription.
10 set_active_version opt opt opt opt Only when re-ingesting a document.
11 clean_inactive opt opt opt opt After set_active_version, to remove old versions.
12 delete_document opt opt opt opt Destructive — removes document + all children.

¹ Recommended when extract_images=True or extract_tables=True (pymupdf) or when the document contains images/tables (docling/vlm_blocks). Without descriptions, Image/Table nodes are invisible to semantic search.

Tool Reference

Tool Description
create_lexical_graph Parse PDF(s) and create the graph (async, returns job_id). max_parallel=0 auto-detects worker count from RAM/CPU.
check_processing_status Monitor background job progress
cancel_job Cancel a running background job (optional cleanup of partial data)
chunk_lexical_graph Create Chunk nodes from Elements (4 strategies: token_window, structured, by_section, by_page)
list_documents Inventory of documents with version and chunk count info
verify_lexical_graph Structural checks + Markdown reconstruction (single-doc only)
assign_section_hierarchy LLM-based section level assignment + rebuilds HAS_SUBSECTION + updates sectionContext on chunks. Omit document_id to run all active documents in parallel.
generate_chunk_descriptions VLM descriptions for Image/Table/Page nodes — stored as textDescription. document_id optional: omit to run for all active documents.
embed_chunks Vector embeddings + fulltext index. Auto-detects textDescription for unified Table/Image/text embedding.
set_active_version Activate a specific document/chunk version
clean_inactive Delete inactive document versions and chunk sets
delete_document Remove a document version with cascade (pages, elements, sections, chunks)

Environment Variables

Variable Required Default Description
NEO4J_URI Yes bolt://localhost:7687 Neo4j connection URI
NEO4J_USERNAME Yes neo4j Neo4j username
NEO4J_PASSWORD Yes - Neo4j password
NEO4J_DATABASE No neo4j Database name
EMBEDDING_MODEL No text-embedding-3-small Default embedding model (LiteLLM providers)
EXTRACTION_MODEL No gpt-5-mini LLM/VLM for section hierarchy and description generation
OPENAI_API_KEY Depends - Required when using OpenAI models for embedding or extraction. Other providers use their own key (e.g. ANTHROPIC_API_KEY, AZURE_API_KEY). See LiteLLM docs

Requirements

  • Neo4j 2026.01+ (native VECTOR type, vector search with filters)
  • Python 3.10+
  • API key for your embedding provider (OpenAI, Azure, Cohere, Voyage, Ollama, etc.)
  • API key for VLM if using vlm_blocks mode, generate_chunk_descriptions, or assign_section_hierarchy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_neo4j_lexical_graph-0.2.0.tar.gz (346.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl (80.8 kB view details)

Uploaded Python 3

File details

Details for the file mcp_neo4j_lexical_graph-0.2.0.tar.gz.

File metadata

  • Download URL: mcp_neo4j_lexical_graph-0.2.0.tar.gz
  • Upload date:
  • Size: 346.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_lexical_graph-0.2.0.tar.gz
Algorithm Hash digest
SHA256 35a9ca53c0ecaf7c1f1f78b7de1af0e8d38f2f643b982ef1cbc791845933b44e
MD5 218e4d3eee5dff91adb11aed92ed0523
BLAKE2b-256 de415db493803ba964d3100d2b6ee1b848f8efb3f24c36a7b2cd92a9db98d99f

See more details on using hashes here.

File details

Details for the file mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 80.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dca96c4cfed1c819253f7c9c41613900ac1e09fff6e793ac04f8517f636287e8
MD5 dd5775a4e40511d1e6505f86766f683e
BLAKE2b-256 389d3f67eea1f316f855c1dc088bb1400bef5218e5f7280ac8886da64cda1f34

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page