Skip to main content

MCP server for extracting entities from text chunks and creating entity graphs in Neo4j. Supports 100+ LLM providers via LiteLLM.

Project description

MCP Neo4j Entity Graph Server

PyPI version Python 3.10+ License: MIT

MCP server for extracting entities and relationships from graph nodes using LLM structured output, creating entity graphs directly in Neo4j.

Supports 100+ LLM providers via LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, etc.)

Features

  • Dual extraction pipeline: Text-only (LLM) and visual (VLM) extraction auto-routed per chunk
  • Grammar-enforced structured output: Pydantic models used as response_format — the LLM cannot violate the schema
  • Async background processing: Long extractions run in background with job tracking
  • Multi-provider LLM support: Use any LLM via LiteLLM (OpenAI, Claude, Gemini, etc.)
  • Schema-driven: Define entity types and relationships to extract
  • Provenance tracking: EXTRACTED_FROM relationships link entities to source chunks
  • High parallelism: Configurable concurrency (text: up to 50, VLM: up to 50)
  • Batched writes: Optimized Neo4j writes (configurable batch size)
  • Incremental: Only processes nodes without prior extraction (unless force=true)
  • Multi-pass ready: Architecture supports entity-only, relationship-only, and corrective passes (v2)

Tools

convert_schema

Converts data model output from the Data Modeling MCP to a Pydantic extraction schema.

Parameters:

Parameter Required Description
modeling_output Yes JSON output from the Data Modeling MCP server
output_path Yes Path to save the Pydantic .py file (e.g. /path/to/schema.py)

Output:

  • {output_path} — Strongly-typed Pydantic models used as response_format for LLM structured extraction

The .py file can be customized before running extraction:

  • Add Literal types to constrain categorical fields (phase, status, therapeutic area...)
  • Add @field_validator for normalization (strip legal suffixes, resolve aliases...)

extract_entities

Extracts entities and relationships from graph nodes using LLM. Returns immediately with a job ID.

The tool auto-detects chunk types and routes accordingly:

  • Text chunks (type="text"): sent to LLM with text only
  • Image/Table chunks (with imageBase64): sent to VLM with text + image
  • Page nodes (:Page label with imageBase64): sent to VLM with text + page image

Parameters:

Parameter Default Description
schema required Path to the Pydantic .py file generated by convert_schema
source_label "Chunk" Label of source nodes (Chunk or Page)
force false Re-extract all nodes (ignore existing EXTRACTED_FROM)
text_parallel 20 Max concurrent text extractions
vlm_parallel 5 Max concurrent VLM extractions
batch_size 10 Chunks to batch before writing to Neo4j
model env var LLM model (defaults to EXTRACTION_MODEL)
pass_type "full" full, entities_only, relationships_only, corrective
pass_number 1 Pass number for multi-pass extraction

check_extraction_status

Monitor background extraction jobs.

Parameter Default Description
job_id None Specific job to check. If omitted, returns all jobs.

cancel_extraction

Cancel a running extraction job.

Parameter Required Description
job_id Yes Job ID to cancel

Quick Start

# 1. Convert schema from Data Modeling MCP
convert_schema(
    modeling_output='{"nodes": [...], "relationships": [...]}',
    output_path="data_models/my_schema.py"
)
# Creates: my_schema.py (Pydantic models, ready for customization)

# 2. (Optional) Open my_schema.py and add Literal constraints / field_validators

# 3. Extract entities (runs in background)
extract_entities(
    schema="data_models/my_schema.py",
)
# Returns: {"job_id": "abc123", "status": "started", ...}

# 4. Check progress
check_extraction_status(job_id="abc123")
# Returns: {"status": "extracting", "chunks_completed": 45, ...}

# 5. Re-run extraction (incremental — only unprocessed nodes)
extract_entities(schema="data_models/my_schema.py")

# 6. Force full re-extraction (e.g. after schema changes)
extract_entities(schema="data_models/my_schema.py", force=True)

Generated Pydantic Models

convert_schema generates a .py file with strongly-typed Pydantic models. The ExtractionOutput class is sent to the LLM as response_format, meaning the LLM output is grammar-constrained — it literally cannot produce values outside the schema.

class DrugEntity(BaseModel):
    _node_label: ClassVar[str] = "Drug"
    _key_property: ClassVar[str] = "name"

    name: str = Field(..., description="Drug name")
    dose: Optional[str] = Field(default=None, description="Dosage")

    @field_validator("name", mode="before")
    @classmethod
    def _normalize_name(cls, v):
        if isinstance(v, str):
            return v.strip()
        return v

class TreatsRel(BaseModel):
    _relationship_type: ClassVar[str] = "TREATS"
    drug_name: str = Field(..., description="Drug name")
    disease_name: str = Field(..., description="Disease name")

class ExtractionOutput(BaseModel):
    drugs: list[DrugEntity] = Field(default_factory=list)
    treats: list[TreatsRel] = Field(default_factory=list)

Customizing the Schema

After running convert_schema, open the .py file and add constraints before extraction:

Literal constraints (normalize categorical fields)

from typing import Literal

class ClinicalProgramEntity(BaseModel):
    # Forces the LLM to pick from these exact values — no more "Phase III" vs "Phase 3"
    phase: Optional[Literal["Phase 1", "Phase 2", "Phase 3", "Registration", "Approved"]] = Field(
        default=None,
        description="Clinical phase — map Phase I→Phase 1, Phase II→Phase 2, Phase III→Phase 3"
    )

Field validators (normalize entity keys to avoid duplicates)

import re

_LEGAL_SUFFIX_RE = re.compile(r",?\s*(Inc\.?|Ltd\.?|AG|SE|GmbH|Pharmaceuticals?)\s*$", re.IGNORECASE)

class CompanyEntity(BaseModel):
    @field_validator("name", mode="before")
    @classmethod
    def _normalize(cls, v):
        if isinstance(v, str):
            v = v.strip()
            while True:
                cleaned = _LEGAL_SUFFIX_RE.sub("", v).strip()
                if cleaned == v:
                    break
                v = cleaned
        return v

Important: Apply the same normalization to the corresponding relationship field (e.g. company_name in DevelopsRel) so that Neo4j MERGE keys always match.

Environment Variables

Variable Default Description
NEO4J_URI bolt://localhost:7687 Neo4j connection URI
NEO4J_USERNAME neo4j Neo4j username
NEO4J_PASSWORD (required) Neo4j password
NEO4J_DATABASE neo4j Neo4j database name
EXTRACTION_MODEL gpt-5-mini Default LLM model for extraction
OPENAI_API_KEY - Required for OpenAI models

Usage with Cursor

Add to your ~/.cursor/mcp.json:

{
  "mcpServers": {
    "neo4j-entity-graph": {
      "command": "uv",
      "args": [
        "--directory", "/path/to/mcp-neo4j-entity-graph",
        "run", "mcp-neo4j-entity-graph"
      ],
      "env": {
        "NEO4J_URI": "neo4j://127.0.0.1:7687",
        "NEO4J_USERNAME": "neo4j",
        "NEO4J_PASSWORD": "your-password",
        "OPENAI_API_KEY": "your-api-key",
        "EXTRACTION_MODEL": "gpt-5-mini"
      }
    }
  }
}

Performance

Tested on pharma pipeline PDFs with gpt-5-mini:

Mode Concurrency Time Entities Relationships
Text-only 50 107s 1,584 1,257
VLM (page images) 50 114s 1,597 1,378

Architecture

server.py           - MCP tools (convert_schema, extract_entities, check/cancel)
job_manager.py      - Async job tracking, progress, cancellation
base_extractor.py   - Shared: prompts, parsing, Pydantic model loading
text_extractor.py   - Text-only LLM extraction (high parallelism)
vlm_extractor.py    - Vision+text VLM extraction (configurable parallelism)
schema_generator.py - Pydantic model code generation from data model
models.py           - Internal types (ExtractionSchema, ClassifiedChunk, etc.)

Graph Schema

After extraction, your Neo4j database will contain:

(:Entity)-[:EXTRACTED_FROM]->(:Chunk)
(:Entity)-[relationship]->(:Entity)

Example query:

MATCH (e)-[:EXTRACTED_FROM]->(c:Chunk)-[:PART_OF]->(d:Document {name: "my-doc"})
RETURN labels(e)[0] as type, count(e) as count
ORDER BY count DESC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_neo4j_entity_graph-0.3.0.tar.gz (214.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_neo4j_entity_graph-0.3.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file mcp_neo4j_entity_graph-0.3.0.tar.gz.

File metadata

  • Download URL: mcp_neo4j_entity_graph-0.3.0.tar.gz
  • Upload date:
  • Size: 214.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_entity_graph-0.3.0.tar.gz
Algorithm Hash digest
SHA256 e5aa934058520dba0faae31ea1fda766f7f71eb8507c19aab5d919a98b1016f2
MD5 3cc1f89ad6b71c16ba09c1d9ba4298c5
BLAKE2b-256 4f601eafb60f6ca5e918ebdc293f163ef4f84812873655673830936ba3a46b32

See more details on using hashes here.

File details

Details for the file mcp_neo4j_entity_graph-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_neo4j_entity_graph-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_entity_graph-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e7fc80c21e5b5c7a89de711d4c4b229ec0ec1274457dfc303e4eb58f6fe106c
MD5 fd35e52610d07f0a3a0780041dab63b1
BLAKE2b-256 f7d2920520584cab8a714ec7795e956c6748cbf9cbc1ca098ff8ec9379acf132

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page