Skip to main content

MCP server for extracting entities from text chunks and creating entity graphs in Neo4j. Supports 100+ LLM providers via LiteLLM.

Project description

MCP Neo4j Entity Graph Server

PyPI version Python 3.10+ License: MIT

MCP server for extracting entities and relationships from graph nodes using LLM structured output, creating entity graphs directly in Neo4j.

Supports 100+ LLM providers via LiteLLM (OpenAI, Anthropic, Google, Azure, Bedrock, Ollama, etc.)

Features

  • Dual extraction pipeline: Text-only (LLM) and visual (VLM) extraction auto-routed per chunk
  • Grammar-enforced structured output: Pydantic models used as response_format — the LLM cannot violate the schema
  • Async background processing: Long extractions run in background with job tracking
  • Multi-provider LLM support: Use any LLM via LiteLLM (OpenAI, Claude, Gemini, etc.)
  • Schema-driven: Define entity types and relationships to extract
  • Provenance tracking: EXTRACTED_FROM relationships link entities to source chunks
  • High parallelism: Configurable concurrency (text: up to 50, VLM: up to 50)
  • Batched writes: Optimized Neo4j writes (configurable batch size)
  • Incremental: Only processes nodes without prior extraction (unless force=true)
  • Multi-pass ready: Architecture supports entity-only, relationship-only, and corrective passes (v2)

Tools

convert_schema

Converts data model output from the Data Modeling MCP to a Pydantic extraction schema.

Parameters:

Parameter Required Description
modeling_output Yes JSON output from the Data Modeling MCP server
output_path Yes Path to save the Pydantic .py file (e.g. /path/to/schema.py)

Output:

  • {output_path} — Strongly-typed Pydantic models used as response_format for LLM structured extraction

The .py file can be customized before running extraction:

  • Add Literal types to constrain categorical fields (phase, status, therapeutic area...)
  • Add @field_validator for normalization (strip legal suffixes, resolve aliases...)

extract_entities

Extracts entities and relationships from graph nodes using LLM. Returns immediately with a job ID.

The tool auto-detects chunk types and routes accordingly:

  • Text chunks (type="text"): sent to LLM with text only
  • Image/Table chunks (with imageBase64): sent to VLM with text + image
  • Page nodes (:Page label with imageBase64): sent to VLM with text + page image

Parameters:

Parameter Default Description
schema required Path to the Pydantic .py file generated by convert_schema
source_label "Chunk" Label of source nodes (Chunk or Page)
force false Re-extract all nodes (ignore existing EXTRACTED_FROM)
text_parallel 20 Max concurrent text extractions
vlm_parallel 5 Max concurrent VLM extractions
batch_size 10 Chunks to batch before writing to Neo4j
model env var LLM model (defaults to EXTRACTION_MODEL)
pass_type "full" full, entities_only, relationships_only, corrective
pass_number 1 Pass number for multi-pass extraction

check_extraction_status

Monitor background extraction jobs.

Parameter Default Description
job_id None Specific job to check. If omitted, returns all jobs.

cancel_extraction

Cancel a running extraction job.

Parameter Required Description
job_id Yes Job ID to cancel

Quick Start

# 1. Convert schema from Data Modeling MCP
convert_schema(
    modeling_output='{"nodes": [...], "relationships": [...]}',
    output_path="data_models/my_schema.py"
)
# Creates: my_schema.py (Pydantic models, ready for customization)

# 2. (Optional) Open my_schema.py and add Literal constraints / field_validators

# 3. Extract entities (runs in background)
extract_entities(
    schema="data_models/my_schema.py",
)
# Returns: {"job_id": "abc123", "status": "started", ...}

# 4. Check progress
check_extraction_status(job_id="abc123")
# Returns: {"status": "extracting", "chunks_completed": 45, ...}

# 5. Re-run extraction (incremental — only unprocessed nodes)
extract_entities(schema="data_models/my_schema.py")

# 6. Force full re-extraction (e.g. after schema changes)
extract_entities(schema="data_models/my_schema.py", force=True)

Generated Pydantic Models

convert_schema generates a .py file with strongly-typed Pydantic models. The ExtractionOutput class is sent to the LLM as response_format, meaning the LLM output is grammar-constrained — it literally cannot produce values outside the schema.

class DrugEntity(BaseModel):
    _node_label: ClassVar[str] = "Drug"
    _key_property: ClassVar[str] = "name"

    name: str = Field(..., description="Drug name")
    dose: Optional[str] = Field(default=None, description="Dosage")

    @field_validator("name", mode="before")
    @classmethod
    def _normalize_name(cls, v):
        if isinstance(v, str):
            return v.strip()
        return v

class TreatsRel(BaseModel):
    _relationship_type: ClassVar[str] = "TREATS"
    drug_name: str = Field(..., description="Drug name")
    disease_name: str = Field(..., description="Disease name")

class ExtractionOutput(BaseModel):
    drugs: list[DrugEntity] = Field(default_factory=list)
    treats: list[TreatsRel] = Field(default_factory=list)

Customizing the Schema

After running convert_schema, open the .py file and add constraints before extraction:

Literal constraints (normalize categorical fields)

from typing import Literal

class ClinicalProgramEntity(BaseModel):
    # Forces the LLM to pick from these exact values — no more "Phase III" vs "Phase 3"
    phase: Optional[Literal["Phase 1", "Phase 2", "Phase 3", "Registration", "Approved"]] = Field(
        default=None,
        description="Clinical phase — map Phase I→Phase 1, Phase II→Phase 2, Phase III→Phase 3"
    )

Field validators (normalize entity keys to avoid duplicates)

import re

_LEGAL_SUFFIX_RE = re.compile(r",?\s*(Inc\.?|Ltd\.?|AG|SE|GmbH|Pharmaceuticals?)\s*$", re.IGNORECASE)

class CompanyEntity(BaseModel):
    @field_validator("name", mode="before")
    @classmethod
    def _normalize(cls, v):
        if isinstance(v, str):
            v = v.strip()
            while True:
                cleaned = _LEGAL_SUFFIX_RE.sub("", v).strip()
                if cleaned == v:
                    break
                v = cleaned
        return v

Important: Apply the same normalization to the corresponding relationship field (e.g. company_name in DevelopsRel) so that Neo4j MERGE keys always match.

Environment Variables

Variable Default Description
NEO4J_URI bolt://localhost:7687 Neo4j connection URI
NEO4J_USERNAME neo4j Neo4j username
NEO4J_PASSWORD (required) Neo4j password
NEO4J_DATABASE neo4j Neo4j database name
EXTRACTION_MODEL gpt-5-mini Default LLM model for extraction
OPENAI_API_KEY - Required for OpenAI models

Usage with Cursor

Add to your ~/.cursor/mcp.json:

{
  "mcpServers": {
    "neo4j-entity-graph": {
      "command": "uv",
      "args": [
        "--directory", "/path/to/mcp-neo4j-entity-graph",
        "run", "mcp-neo4j-entity-graph"
      ],
      "env": {
        "NEO4J_URI": "neo4j://127.0.0.1:7687",
        "NEO4J_USERNAME": "neo4j",
        "NEO4J_PASSWORD": "your-password",
        "OPENAI_API_KEY": "your-api-key",
        "EXTRACTION_MODEL": "gpt-5-mini"
      }
    }
  }
}

Performance

Tested on pharma pipeline PDFs with gpt-5-mini:

Mode Concurrency Time Entities Relationships
Text-only 50 107s 1,584 1,257
VLM (page images) 50 114s 1,597 1,378

Architecture

server.py           - MCP tools (convert_schema, extract_entities, check/cancel)
job_manager.py      - Async job tracking, progress, cancellation
base_extractor.py   - Shared: prompts, parsing, Pydantic model loading
text_extractor.py   - Text-only LLM extraction (high parallelism)
vlm_extractor.py    - Vision+text VLM extraction (configurable parallelism)
schema_generator.py - Pydantic model code generation from data model
models.py           - Internal types (ExtractionSchema, ClassifiedChunk, etc.)

Graph Schema

After extraction, your Neo4j database will contain:

(:Entity)-[:EXTRACTED_FROM]->(:Chunk)
(:Entity)-[relationship]->(:Entity)

Example query:

MATCH (e)-[:EXTRACTED_FROM]->(c:Chunk)-[:PART_OF]->(d:Document {name: "my-doc"})
RETURN labels(e)[0] as type, count(e) as count
ORDER BY count DESC

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mcp_neo4j_entity_graph-0.4.0.tar.gz (235.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mcp_neo4j_entity_graph-0.4.0-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file mcp_neo4j_entity_graph-0.4.0.tar.gz.

File metadata

  • Download URL: mcp_neo4j_entity_graph-0.4.0.tar.gz
  • Upload date:
  • Size: 235.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_entity_graph-0.4.0.tar.gz
Algorithm Hash digest
SHA256 22839c45b72ef2056703913829b7a086de2cb970d17e0e5fa07986bec6741c60
MD5 c0dad31790c22cf221222b45046ff268
BLAKE2b-256 a216f6778ceeb734cafc46853080f8d521ab32e3146235f65498a4444fb6a639

See more details on using hashes here.

File details

Details for the file mcp_neo4j_entity_graph-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: mcp_neo4j_entity_graph-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 41.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mcp_neo4j_entity_graph-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6d086d50d9176e657d0610596417504469e64d73958a926d1736bd1ed7900f74
MD5 3ef9d7c155d24c19a5ae07d86c53f13e
BLAKE2b-256 8f92f8ccc1ef572abd78dbb9321c2b0ac71984f79ec95818673e758adfba7577

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page