MCP server for creating rich lexical graphs from PDFs in Neo4j. Supports multiple parse modes (pymupdf, docling, page-image), pluggable chunking strategies, document versioning, and structural verification.
Project description
mcp-neo4j-lexical-graph
MCP server for creating rich lexical graphs from PDF documents in Neo4j. Designed for Neo4j sales engineers to quickly build PDF-to-graph and GraphRAG agent chatbot POCs.
Supports four parsing strategies (PyMuPDF, Docling, page-image, VLM block ordering), pluggable chunking, document versioning, VLM-based description generation, and vector/fulltext search with Neo4j 2026.01 native VECTOR type and document-name prefiltering.
Graph Model
graph LR
Doc[Document] -->|HAS_PAGE| Page
Doc -->|HAS_ELEMENT| Img[Image]
Doc -->|HAS_ELEMENT| Tbl[Table]
Doc -->|HAS_SECTION| Sec[Section]
Sec -->|HAS_SUBSECTION| Sec
Chunk -->|PART_OF| Doc
Chunk -->|NEXT_CHUNK| Chunk
Chunk -->|HAS_ELEMENT| Img
Chunk -->|HAS_ELEMENT| Tbl
Page -->|NEXT_PAGE| Page
Node types depend on the parse mode used. See Parse Modes below.
Parse Modes
| Mode | Nodes created | Best for |
|---|---|---|
pymupdf |
Document, Chunk, Image, Table | General-purpose text + visual extraction |
docling |
Document, Page, Element, Section, (then Chunk via chunking tool) | Complex layouts, section-aware chunking |
page_image |
Document, Page | Slides/presentations for VLM-based extraction |
vlm_blocks |
Document, Page, Element, Section, (then Chunk via chunking tool) | Experimental. Complex layouts without docling dependency (uses VLM API). Prefer docling for production use. |
Quick Start
cd mcp-neo4j-lexical-graph
uv sync
Cursor MCP Configuration
Add to your .cursor/mcp.json:
{
"mcpServers": {
"neo4j-lexical-graph": {
"command": "uv",
"args": [
"--directory",
"/path/to/mcp-neo4j-lexical-graph",
"run",
"mcp-neo4j-lexical-graph"
],
"env": {
"NEO4J_URI": "bolt://localhost:7687",
"NEO4J_USERNAME": "neo4j",
"NEO4J_PASSWORD": "your-password",
"NEO4J_DATABASE": "neo4j",
"EMBEDDING_MODEL": "text-embedding-3-small",
"EXTRACTION_MODEL": "gpt-5-mini",
"OPENAI_API_KEY": "sk-..."
}
}
}
}
Tools
Tools must be called in a specific order — which tools to call depends on the parse mode and document type. See the workflow table below.
Workflow Order
| # | Tool | pymupdf | docling | page_image | vlm_blocks | Notes |
|---|---|---|---|---|---|---|
| 1 | create_lexical_graph |
✓ | ✓ | ✓ | ✓ | Always first. Async — returns job_id. |
| 2 | check_processing_status |
✓ | ✓ | ✓ | ✓ | Poll until complete after any async op. |
| 3 | cancel_job |
opt | opt | opt | opt | Only if aborting a running job. |
| 4 | chunk_lexical_graph |
✗ | ✓ | ✓ | ✓ | Required for docling/vlm_blocks/page_image. Integrated into create for pymupdf. |
| 5 | list_documents |
✓ | ✓ | ✓ | ✓ | Confirm ingestion, get document IDs. |
| 6 | verify_lexical_graph |
opt | opt | ✗ never | opt | Single-doc spot-check only. Never for page_image (base64 flood). |
| 7 | assign_section_hierarchy |
✗ | opt | ✗ | opt | For structured docs with nested sections. Uses EXTRACTION_MODEL. |
| 8 | generate_chunk_descriptions |
recommended¹ | recommended¹ | required | recommended¹ | VLM descriptions for Image/Table/Page nodes. Required before embed_chunks for page_image. |
| 9 | embed_chunks |
✓ | ✓ | ✓ | ✓ | Synchronous. Call with no parameters — auto-detects textDescription. |
| 10 | set_active_version |
opt | opt | opt | opt | Only when re-ingesting a document. |
| 11 | clean_inactive |
opt | opt | opt | opt | After set_active_version, to remove old versions. |
| 12 | delete_document |
opt | opt | opt | opt | Destructive — removes document + all children. |
¹ Recommended when extract_images=True or extract_tables=True (pymupdf) or when the document contains images/tables (docling/vlm_blocks). Without descriptions, Image/Table nodes are invisible to semantic search.
Tool Reference
| Tool | Description |
|---|---|
create_lexical_graph |
Parse PDF(s) and create the graph (async, returns job_id). max_parallel=0 auto-detects worker count from RAM/CPU. |
check_processing_status |
Monitor background job progress |
cancel_job |
Cancel a running background job (optional cleanup of partial data) |
chunk_lexical_graph |
Create Chunk nodes from Elements (4 strategies: token_window, structured, by_section, by_page) |
list_documents |
Inventory of documents with version and chunk count info |
verify_lexical_graph |
Structural checks + Markdown reconstruction (single-doc only) |
assign_section_hierarchy |
LLM-based section level assignment + rebuilds HAS_SUBSECTION + updates sectionContext on chunks. Omit document_id to run all active documents in parallel. |
generate_chunk_descriptions |
VLM descriptions for Image/Table/Page nodes — stored as textDescription. document_id optional: omit to run for all active documents. |
embed_chunks |
Vector embeddings + fulltext index. Auto-detects textDescription for unified Table/Image/text embedding. |
set_active_version |
Activate a specific document/chunk version |
clean_inactive |
Delete inactive document versions and chunk sets |
delete_document |
Remove a document version with cascade (pages, elements, sections, chunks) |
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
NEO4J_URI |
Yes | bolt://localhost:7687 |
Neo4j connection URI |
NEO4J_USERNAME |
Yes | neo4j |
Neo4j username |
NEO4J_PASSWORD |
Yes | - | Neo4j password |
NEO4J_DATABASE |
No | neo4j |
Database name |
EMBEDDING_MODEL |
No | text-embedding-3-small |
Default embedding model (LiteLLM providers) |
EXTRACTION_MODEL |
No | gpt-5-mini |
LLM/VLM for section hierarchy and description generation |
OPENAI_API_KEY |
Depends | - | Required when using OpenAI models for embedding or extraction. Other providers use their own key (e.g. ANTHROPIC_API_KEY, AZURE_API_KEY). See LiteLLM docs |
Requirements
- Neo4j 2026.01+ (native VECTOR type, vector search with filters)
- Python 3.10+
- API key for your embedding provider (OpenAI, Azure, Cohere, Voyage, Ollama, etc.)
- API key for VLM if using
vlm_blocksmode,generate_chunk_descriptions, orassign_section_hierarchy
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_neo4j_lexical_graph-0.2.0.tar.gz.
File metadata
- Download URL: mcp_neo4j_lexical_graph-0.2.0.tar.gz
- Upload date:
- Size: 346.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35a9ca53c0ecaf7c1f1f78b7de1af0e8d38f2f643b982ef1cbc791845933b44e
|
|
| MD5 |
218e4d3eee5dff91adb11aed92ed0523
|
|
| BLAKE2b-256 |
de415db493803ba964d3100d2b6ee1b848f8efb3f24c36a7b2cd92a9db98d99f
|
File details
Details for the file mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mcp_neo4j_lexical_graph-0.2.0-py3-none-any.whl
- Upload date:
- Size: 80.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dca96c4cfed1c819253f7c9c41613900ac1e09fff6e793ac04f8517f636287e8
|
|
| MD5 |
dd5775a4e40511d1e6505f86766f683e
|
|
| BLAKE2b-256 |
389d3f67eea1f316f855c1dc088bb1400bef5218e5f7280ac8886da64cda1f34
|