Local-first Knowledge GraphRAG MCP server
Project description
Knowledge GraphRAG MCP Server
This module hosts the local-first Knowledge GraphRAG MCP server defined in ../docs/knowledge-graph-rag-mcp.md. The spec outlines the document watcher/normalizer pipeline, entity & relation extraction, EmbeddingGemma vectorization, SQLite (sqlite-vec, bfsvtab) storage, and the MCP tool surface for retrieval and maintenance.
Next Steps
- Break down the spec into implementation milestones (schema, normalizers, extraction/linking, retrieval).
- Scaffold the codebase (
src/,tests/, configuration) aligned with the deliverables checklist.
Review the spec for architecture diagrams, pseudo-code, and validation guidance.
Packaging
The project is published as a Python package with a bundled copy of the bfsvtab
SQLite extension. Local builds automatically compile the shared library when
pip installs the project. To produce distribution artifacts:
python -m build
The build step generates dist/*.whl and dist/*.tar.gz artifacts that embed
bfsvtab alongside the Python modules, so downstream users do not need to
compile extensions manually.
MCP Server — Knowledge GraphRAG (Local-First)
Goal: A single binary MCP server that watches a knowledge directory, auto‑ingests documents, builds a hybrid GraphRAG index (entities + relations + vectors), and exposes minimal, stable MCP tools for retrieval and maintenance. Fully local: SQLite (+ sqlite-vec for vectors, bfsvtab for k‑hop traversal), EmbeddingGemma for embeddings, and small NER + pattern extractors. Designed to run multiple instances (one per project) with near‑zero ops.
1) Scope
In-scope
- Directory watcher → normalize docs (md/html/pdf/docx/txt) → chunk → entity/mention extraction → entity linking → relation extraction → vectorization → SQLite write
- Hybrid retrieval: vector prefilter + graph expansion + re‑ranking
- MCP tools for ingest, refresh, search, explain, status
- Incremental updates, WAL, single-writer queue
Out-of-scope (for v1)
- Heavy LLM extraction/validation loops
- Graph algorithms beyond BFS (migrate to Memgraph later if needed)
2) Architecture
flowchart LR
W[File Watcher] -->|paths| Q[Job Queue]
Q -->|batch| P[Parser + Normalizer]
P --> C[Chunker]
C --> E1[Entity & Mention Extractor]
E1 --> L[Entity Linker]
L --> R[Relation Extractor]
R --> V[Embedder (EmbeddingGemma)]
V --> DB[(SQLite + sqlite-vec + bfsvtab)]
subgraph MCP Server
T1[ingest_docs]
T2[extract_and_link]
T3[hybrid_query]
T4[entity_lookup]
T5[explain_entity]
T6[status]
end
DB <-->|read/write| MCP Server
Concurrency model: Single writer connection (queued), many readers. SQLite WAL mode.
3) Directory Watching & Update Strategy
- Watchers:
watchdog(Python) /chokidar(Node). Cross‑platform. - Debounce: 200–500 ms per path; coalesce bursts.
- Events:
create/modify→ enqueuereindex(path);rename→ unlink+add;delete→ mark file deleted and purge rows. - Transactions: Per file (or small batch):
BEGIN IMMEDIATE→ write →COMMIT. - WAL/Timeouts:
PRAGMA journal_mode=WAL; PRAGMA synchronous=NORMAL; PRAGMA busy_timeout=3000;.
4) Document Normalization
- Types:
.md,.txt,.html,.pdf,.docx(pluggable). - HTML→MD, PDF→text (keep headings, lists, code blocks where possible).
- Boilerplate removal: drop nav/TOC/footers by CSS selectors / heuristics.
- De‑dup: MinHash/SimHash on paragraph hashes; skip near-duplicates.
- Metadata:
source,path,mtime_ms,title,lang,breadcrumbs,tags,security,hash.
5) Chunking
- General docs: 500–1,000 tokens; sliding window 10–20% overlap.
- FAQs/definitions: 150–400 tokens.
- Procedures: 1,000–2,000 tokens (keep step lists intact).
- Prelude: prefix each chunk with
path • section • last 2 headingsfor disambiguation.
6) Entity & Mention Extraction (label‑free tolerant)
Hybrid approach:
-
Small NER (spaCy
en_core_web_sm/ distilBERT‑NER) forPERSON/ORG/GPE/PRODUCT/DATE. -
Auto‑gazetteers (no labels):
- Mine n‑grams (1–4) from titles/headings/bold/code spans across the corpus; keep top‑K per project.
- Mine CamelCase/SNAKE_CASE/API tokens; semantic version strings.
-
Regex/patterns: IDs, standards, versions (e.g.,
v?\d+\.\d+(\.\d+)?(-[a-z0-9]+)?), “Step n:”, “Prerequisites”. -
Coref‑lite: pronouns/demonstratives resolved to nearest compatible entity in the same section.
Mention record: surface, type, span, model_score, features{isGazetteerHit, inTitle, codeFont, editDistance, ...}, contextSnippet.
7) Entity Linking (canonicalization)
-
Blocking: normalize
surface→norm = lowercase(alnum_only(surface)).- Candidates: entities sharing
norm(edit distance ≤1), token Jaccard ≥0.6, or vector sim ≥ τ₁ viaentity_vec.
- Candidates: entities sharing
-
Scoring:
score(E) = 0.45*cosine( embed(surface + context), E.embedding )
+ 0.25*string_sim(surface, E.name/aliases)
+ 0.15*type_prior
+ 0.10*context_overlap(headings/tags)
+ 0.05*popularity(E)
- If
score ≥ τ₂(e.g., 0.72) → link; else create new entity withaliases=[surface]. - Store entity embedding from
name + first definition sentence. same_asedges for later merges; soft‑merge aliases.
8) Relation Extraction (pattern‑first)
-
SVO patterns: “X uses Y / depends on Y / part of Y / integrates with Y / configured via Y / owned by Y”.
- Implement via dependency parse or regex templates over sentences.
-
Structural:
- Section Dependencies/Requirements →
depends_on - Numbered lists →
precedesbetween sequential steps - “See also/References” →
cites/related_to
- Section Dependencies/Requirements →
-
Heuristic: chunk title “Getting Started with Foo SDK” + mention “Bar Cloud” →
FooSDK -uses-> BarCloud. -
Confidence: pattern weight × proximity × link score.
Relation vocabulary (keep small & consistent):
defines, refers_to, part_of, uses, depends_on, precedes, owned_by, located_in, cites, same_as.
9) Embeddings (EmbeddingGemma)
- Dimensionality: default 512‑d (balance quality/size). Allow 768/384/256/128 via MRL.
- Quantization: 8‑bit in
sqlite-vecto reduce DB size (~4× smaller). - What to embed: chunks (content+prelude), entities (name+definition), queries.
10) SQLite Data Model (DDL)
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
PRAGMA foreign_keys=ON;
PRAGMA busy_timeout=3000;
-- Documents
CREATE TABLE IF NOT EXISTS docs (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE,
source TEXT,
mtime_ms INTEGER,
meta JSON
);
CREATE INDEX IF NOT EXISTS idx_docs_path ON docs(path);
-- Chunks
CREATE TABLE IF NOT EXISTS chunks (
id INTEGER PRIMARY KEY,
doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE,
content TEXT,
meta JSON -- {lang, title, section, breadcrumbs[], tags[], hash, prelude}
);
CREATE INDEX IF NOT EXISTS idx_chunks_doc ON chunks(doc_id);
-- Entities (canonical)
CREATE TABLE IF NOT EXISTS entities (
id INTEGER PRIMARY KEY,
type TEXT,
name TEXT,
norm TEXT,
meta JSON, -- {aliases[], description, popularity, created_at}
status TEXT DEFAULT 'active'
);
CREATE INDEX IF NOT EXISTS idx_entities_norm ON entities(norm);
CREATE INDEX IF NOT EXISTS idx_entities_type_name ON entities(type, name);
-- Mentions (surface spans in chunks)
CREATE TABLE IF NOT EXISTS mentions (
id INTEGER PRIMARY KEY,
entity_id INTEGER NULL REFERENCES entities(id) ON DELETE SET NULL,
doc_id INTEGER REFERENCES docs(id) ON DELETE CASCADE,
chunk_id INTEGER REFERENCES chunks(id) ON DELETE CASCADE,
span_start INTEGER, span_end INTEGER,
surface TEXT,
type TEXT,
meta JSON
);
CREATE INDEX IF NOT EXISTS idx_mentions_doc ON mentions(doc_id);
CREATE INDEX IF NOT EXISTS idx_mentions_entity ON mentions(entity_id);
-- Relations (graph over canonical entities)
CREATE TABLE IF NOT EXISTS relations (
src_id INTEGER REFERENCES entities(id) ON DELETE CASCADE,
dst_id INTEGER REFERENCES entities(id) ON DELETE CASCADE,
rel TEXT,
meta JSON,
PRIMARY KEY (src_id, dst_id, rel)
);
CREATE INDEX IF NOT EXISTS idx_rel_src_rel ON relations(src_id, rel);
CREATE INDEX IF NOT EXISTS idx_rel_dst_rel ON relations(dst_id, rel);
-- Vector indices (sqlite-vec)
CREATE VIRTUAL TABLE IF NOT EXISTS chunk_vec USING vec0(dim=512);
CREATE TABLE IF NOT EXISTS chunk_vec_map (
chunk_id INTEGER PRIMARY KEY REFERENCES chunks(id) ON DELETE CASCADE,
rowid INTEGER UNIQUE
);
CREATE VIRTUAL TABLE IF NOT EXISTS entity_vec USING vec0(dim=512);
CREATE TABLE IF NOT EXISTS entity_vec_map (
entity_id INTEGER PRIMARY KEY REFERENCES entities(id) ON DELETE CASCADE,
rowid INTEGER UNIQUE
);
-- BFS view for bfsvtab
CREATE VIEW IF NOT EXISTS graph_edges AS
SELECT src_id AS src, dst_id AS dst FROM relations;
bfsvtab usage: build virtual table at runtime, e.g. CREATE VIRTUAL TABLE bfs USING bfsvtab(graph_edges); then query SELECT * FROM bfs WHERE start = ? AND max_depth = 2;
11) MCP Tooling (API)
Design for small, predictable JSON I/O. Tool names & example schemas:
11.1 ingest_docs
Args:
{
"paths": ["/knowledge/**/*.md"],
"tags": ["docs", "kb"],
"skip_if_seen": true
}
Returns: { "ingested": 123, "skipped": 45, "errors": [] }
11.2 extract_and_link
Runs extraction/linking for specified docs (or pending queue).
{ "doc_ids": [1,2,3] }
Returns: { "mentions": 420, "entities_new": 18, "relations": 95 }
11.3 entity_lookup
{ "q": "ISO 27001", "type": "Regulation" }
Returns: { "entities": [{"id": 7, "name": "ISO 27001", "type":"Regulation", "aliases": ["ISO27001"], "score": 0.93}] }
11.4 hybrid_query
{
"q": "data retention policy for S3 lifecycle",
"k": 40,
"hops": 2,
"rels": ["defines", "depends_on", "uses", "cites"]
}
Returns:
{
"chunks": [{"id": 101, "doc_id": 9, "snippet": "...", "path": "/knowledge/policies/..."}],
"entities": [{"id": 33, "name": "S3 Lifecycle", "type": "API"}],
"edges": [{"src": 33, "dst": 12, "rel": "depends_on"}],
"explanations": ["Selected by semantic match + 1-hop depends_on"]
}
11.5 explain_entity
{ "entity_id": 33, "hops": 2 }
Returns: definition, aliases, top relations, key sources with confidence.
11.6 status
{}
Returns: queue depth, last file, counts per table, last error.
12) Implementation Snippets
12.1 Watcher & Writer Queue (Python)
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from queue import Queue
import time, threading
jobs = Queue(maxsize=1000)
def enqueue(path, kind):
jobs.put({"path": path, "kind": kind, "ts": time.time()})
class Handler(FileSystemEventHandler):
def on_modified(self, e):
if not e.is_directory: enqueue(e.src_path, "modify")
def on_created(self, e):
if not e.is_directory: enqueue(e.src_path, "create")
def on_deleted(self, e):
if not e.is_directory: enqueue(e.src_path, "delete")
observer = Observer()
observer.schedule(Handler(), path="/knowledge", recursive=True)
observer.start()
# Single writer thread
from db import Writer
writer = Writer(db_path="/data/graphrag.sqlite")
def worker():
while True:
job = jobs.get()
try:
writer.process(job) # handles debounce, hashing, parse->extract->link->embed->commit
except Exception as ex:
print("error", ex)
finally:
jobs.task_done()
threading.Thread(target=worker, daemon=True).start()
12.2 SQLite Helper (WAL, single writer)
import sqlite3
def open_db(path):
con = sqlite3.connect(path, isolation_level=None, check_same_thread=False)
con.execute("PRAGMA journal_mode=WAL;")
con.execute("PRAGMA synchronous=NORMAL;")
con.execute("PRAGMA foreign_keys=ON;")
con.execute("PRAGMA busy_timeout=3000;")
return con
class Writer:
def __init__(self, db_path):
self.con = open_db(db_path)
def begin(self): self.con.execute("BEGIN IMMEDIATE;")
def commit(self): self.con.execute("COMMIT;")
def process(self, job):
path = job['path']
# 1) stat + hash; 2) if unchanged -> return
# 3) parse/normalize -> chunks
# 4) extract mentions -> link entities -> relations
# 5) embed chunks/entities (outside tx), then write inside tx
self.begin()
try:
# upsert docs/chunks/entities/mentions/relations and vec maps
# delete stale, insert new
self.commit()
except:
self.con.execute("ROLLBACK;")
raise
12.3 Embeddings (EmbeddingGemma wrapper)
# Pseudocode; implement with HF transformers or a local runtime
from embeddings import embed_many # returns List[ndarray]
chunk_vecs = embed_many([chunk_text1, chunk_text2], model="embedding-gemma-512")
# insert into sqlite-vec: INSERT INTO chunk_vec(rowid, vector) VALUES (?, ?)
12.4 sqlite-vec inserts
-- After creating chunk_vec(vec0), map chunk_id -> rowid
INSERT INTO chunk_vec(rowid, vector) VALUES (?, ?);
INSERT OR REPLACE INTO chunk_vec_map(chunk_id, rowid) VALUES (?, last_insert_rowid());
12.5 Vector Search + BFS expansion (hybrid)
-- 1) vector prefilter (pseudo-sql; see sqlite-vec docs for exact fn names)
SELECT m.chunk_id, distance
FROM chunk_vec v
JOIN chunk_vec_map m ON m.rowid = v.rowid
ORDER BY distance(?, v.vector)
LIMIT 50;
-- 2) graph expansion (bfsvtab)
CREATE VIRTUAL TABLE IF NOT EXISTS bfs USING bfsvtab(graph_edges);
SELECT * FROM bfs WHERE start = :entity_id AND max_depth = 2;
12.6 Simple Relation Patterns (regex example)
import re
USES = re.compile(r"\b(uses|integrates with|built on|powered by)\b", re.I)
DEPENDS = re.compile(r"\b(depends on|requires|needs)\b", re.I)
PARTOF = re.compile(r"\b(part of|component of|belongs to)\b", re.I)
# For each sentence, if it contains two linked entities A,B:
# if USES.search(sent): add edge A -uses-> B with confidence
13) Configuration (YAML)
project: "knowledge-graphrag"
watch:
dir: "/knowledge"
debounce_ms: 300
extract:
ner: "spacy:en_core_web_sm"
gazetteer:
mine_topk: 2000
min_freq: 3
regex:
version: "v?\d+\.\d+(\.\d+)?(-[a-z0-9]+)?"
embed:
model: "embedding-gemma-512"
dim: 512
quantize: 8
sqlite:
path: "/data/graphrag.sqlite"
wal: true
retrieval:
k: 40
hops: 2
rels: ["defines","depends_on","uses","cites"]
weights:
semantic: 0.7
graph: 0.3
14) Retrieval Scoring
final_score = 0.7 * semantic_sim(query, chunk)
+ 0.2 * hop_score(0:1.0, 1:0.7, 2:0.4)
+ 0.1 * rel_weight(uses:0.9, depends_on:0.8, defines:1.0, cites:0.5)
Return ranked context with provenance (doc path, section, snippet) and confidence.
15) Status & Monitoring
status()tool returns: queue depth, last processed file, table counts, last error- Periodic metrics: ingest rate, avg tx duration, vector count, orphaned entities
- Optional
review_links()tool to surface low‑confidence links/edges
16) Security & Privacy
- Default deny network fetches (ingest only local files); if the agent pulls web docs, store provenance URL in
docs.meta. - Redact secrets via regex before storage; maintain an allowlist of paths.
- Namespaces: add
project_idcolumn to all tables if running multi‑tenant in one DB.
17) Testing & Quality
- Unit tests: parser adapters, regex patterns, linker scoring
- Golden set: 20–50 pages hand‑annotated for quick F1 checks on entities/relations
- Smoke: end‑to‑end ingest of a small sample; deterministic hashes ensure idempotency
18) Performance Notes
- Compute embeddings outside the transaction; write vectors + maps in one short tx
- Indexes:
entities(norm),entities(type,name),relations(src,rel),relations(dst,rel) - Run
PRAGMA incremental_vacuumoccasionally if churn is high - Quantize embeddings to 8‑bit for 4× space reduction
19) Migration Path (to Memgraph)
- Keep the same MCP tool contract and data model (entities/relations)
- Export entities/relations to CSV; import into Memgraph; redirect graph ops to Cypher
- Keep
sqlite-vecfor vectors or switch to an external vector DB
20) Minimal MCP Server Skeleton (TypeScript, pseudo)
import { createServer, Tool } from "@anthropic-ai/mcp"; // conceptually
import { hybridQuery, ingestDocs, extractAndLink, entityLookup, explainEntity, getStatus } from "./handlers";
const tools: Tool[] = [
{ name: "ingest_docs", schema: {/*...*/}, handler: ingestDocs },
{ name: "extract_and_link", schema: {/*...*/}, handler: extractAndLink },
{ name: "entity_lookup", schema: {/*...*/}, handler: entityLookup },
{ name: "hybrid_query", schema: {/*...*/}, handler: hybridQuery },
{ name: "explain_entity", schema: {/*...*/}, handler: explainEntity },
{ name: "status", schema: {/*...*/}, handler: getStatus },
];
createServer({ tools, port: process.env.PORT || 8765 });
21) Deliverables Checklist
- SQLite schema & migrations
- Watcher + single-writer queue
- Normalizers (md/html/pdf/docx)
- Chunker with preludes
- NER + gazetteer miner + regex patterns
- Linker + entity vectors
- Relation extractor + confidence
- Embedder (EmbeddingGemma) + sqlite-vec glue
- BFS (bfsvtab) setup + hybrid retrieval
- MCP tools + JSON schemas
- Config YAML + CLI flags
- Tests + sample corpus + smoke script
Notes
- Keep the tool surfaces tiny and stable; resist feature creep.
- Prefer correctness & debuggability (provenance everywhere) over recall in v1.
- Add optional, low‑frequency LLM validation passes only when specific relations routinely misfire.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file knowledge_graph_rag_mcp-0.1.1.tar.gz.
File metadata
- Download URL: knowledge_graph_rag_mcp-0.1.1.tar.gz
- Upload date:
- Size: 193.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89677caefd8b8f65c650362efc651f9ef9750bf87d84e29e389f2d237ad38e42
|
|
| MD5 |
5a4f37aaabd515fb550a0e135e6987ef
|
|
| BLAKE2b-256 |
26934c023538b03ae921d514514b78970aa886a845c4710ff88d52ae66863e33
|
File details
Details for the file knowledge_graph_rag_mcp-0.1.1-py3-none-any.whl.
File metadata
- Download URL: knowledge_graph_rag_mcp-0.1.1-py3-none-any.whl
- Upload date:
- Size: 53.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36c121f48734a3829a64dcd1c584749df15889ca1c54ffc2087275e4528b7a0c
|
|
| MD5 |
74b1a835ce8ee5c4e76d4bd9b9b108d0
|
|
| BLAKE2b-256 |
88b0c5fea995a4106e1ecb098a11a5e9d7e657fa9ea6b8ca9b96b5e6d2f444a5
|