A tool to build a semantically searchable knowledge graph from markdown and text documents
Project description
DocKG — A Hybrid Knowledge Graph for Document Corpora with Semantic Indexing and Source-Grounded Passage Packing
Author: Eric G. Suchanek, PhD Flux-Frontiers, Liberty TWP, OH
Overview
DocKG constructs a deterministic, explainable knowledge graph from a corpus of Markdown, plain-text, and PDF documents. It semantically chunks text, discovers structural and semantic relationships between sections and chunks, stores them in SQLite, and augments retrieval with vector embeddings via LanceDB.
Structure is treated as ground truth; semantic search is strictly an acceleration layer. The result is a searchable, auditable representation of a document corpus — an ideal retrieval engine for LLMs and a practical foundation for Knowledge-Graph RAG (KRAG).
DocKG uses the same architecture as CodeKG but targets natural-language documents rather than Python source code.
Features
- Multi-format ingestion —
.md,.txt,.rst, and.pdf(native — no inference) - Semantic chunking — Heading-structure and paragraph-aware segmentation
- Deterministic knowledge graph — SQLite-backed canonical store with typed nodes and provenance-tracked edges
- Relation extraction — Topics, named entities, and keywords per chunk; co-occurrence and similarity edges built automatically
- Hybrid query model — Semantic seeding (LanceDB embeddings) + structural expansion (graph traversal)
- Passage packing — Context-rich text passages grounded to source documents with headings
- Corpus health analysis — Per-document metrics, hot chunks, orphan detection, coverage report
- Temporal snapshots — Save and diff graph metrics over time
- MCP server — Four tools for AI agent integration (
graph_stats,query_docs,pack_docs,get_node) - Streamlit web app — Interactive graph browser, hybrid query UI, and passage pack explorer
Quick Start
# Index a document corpus (SQLite + LanceDB in one step)
dockg build docs/
# Natural-language query — returns ranked document chunks
dockg query "authentication flow"
# Source-grounded passage pack — paste straight into an LLM prompt
dockg pack "configuration reference" --format md --out context.md
Installation
Requirements: Python ≥ 3.12, < 3.14
# pip
pip install doc-kg
# With Streamlit web visualizer
pip install 'doc-kg[viz]'
# Poetry
poetry add doc-kg
For advanced deployment options (Streamlit Cloud, Fly.io, offline model cache, git hooks) see docs/deployment.md.
Usage
Build and query
dockg build docs/ # full pipeline
dockg build docs/ --update # incremental (keep existing)
dockg build docs/ --exclude-dir archive # skip directories
dockg query "deployment configuration" # hybrid search
dockg pack "error handling" --format md --out ctx.md # passage pack
Analyze corpus health
dockg analyze docs/ # full report + JSON snapshot
dockg analyze docs/ --quiet # CI mode — exits 1 on issues
Snapshots
dockg snapshot save 0.12.0 # capture current metrics
dockg snapshot diff 0.11.0 0.12.0 # compare two versions
Full flag reference for every command: docs/CLI.md Query patterns and MCP examples: docs/CHEATSHEET.md
MCP Integration
Start the MCP server, then wire it into your AI agent:
dockg mcp --repo docs/
Claude Code / Kilo Code — add to .mcp.json:
{
"mcpServers": {
"dockg": { "command": "dockg-mcp", "args": ["--repo", "."] }
}
}
GitHub Copilot — add to .vscode/mcp.json:
{
"servers": {
"dockg": { "type": "stdio", "command": "dockg-mcp", "args": ["--repo", "."] }
}
}
| Tool | Description |
|---|---|
graph_stats() |
Node and edge counts by kind |
query_docs(q, k, hop) |
Hybrid semantic + structural search |
pack_docs(q, k, hop) |
Source-grounded passages as Markdown |
get_node(node_id) |
Fetch a single node by ID |
Full provider setup (Claude Desktop, Cline, SSE transport): docs/MCP.md
Python API
from doc_kg import DocKG
kg = DocKG(corpus_root="docs/")
kg.build(wipe=True)
result = kg.query("deployment configuration", k=8, hop=1)
for node in result.nodes:
print(node["id"], node["name"])
pack = kg.pack("authentication flow")
pack.save("context.md")
Knowledge Graph Schema
Node kinds
| Kind | Description |
|---|---|
document |
A source .md, .txt, or .pdf file |
section |
A heading-delimited region within a document |
chunk |
A semantically coherent text passage |
topic |
A topic extracted from chunk text |
entity |
A named entity (person, place, org, concept) |
keyword |
A keyword or key phrase |
Edge types
| Type | Description |
|---|---|
CONTAINS |
Parent → child (document→section, section→chunk) |
NEXT |
Sequential ordering between same-level nodes |
REFERENCES |
Chunk references another document or section |
SIMILAR_TO |
Semantic similarity between chunks (LanceDB-derived) |
HAS_TOPIC |
Chunk → topic |
MENTIONS_ENTITY |
Chunk → named entity |
HAS_KEYWORD |
Chunk → keyword |
CO_OCCURS_WITH |
Co-occurrence between topics/entities within a chunk |
Storage Layout
.dockg/
graph.sqlite # SQLite knowledge graph (nodes + edges)
lancedb/ # LanceDB vector index
snapshots/ # Temporal metric snapshots (JSON)
manifest.json
<commit>.json
Citation
If you use DocKG in research or a project, please cite it:
APA
Suchanek, E. G. (2026). DocKG: Hybrid Knowledge Graph for Document Corpora (Version 0.12.1) [Software]. Flux-Frontiers. https://doi.org/10.5281/zenodo.19770973
BibTeX
@software{suchanek_doc_kg,
author = {Suchanek, Eric G.},
title = {{DocKG}: Hybrid Knowledge Graph for Document Corpora},
version = {0.12.1},
year = {2026},
publisher = {Flux-Frontiers},
url = {https://github.com/Flux-Frontiers/doc_kg},
doi = {10.5281/zenodo.19770973},
}
License
Elastic License 2.0 — free for non-commercial and internal use; commercial redistribution requires a license from Flux-Frontiers.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file doc_kg-0.12.3.tar.gz.
File metadata
- Download URL: doc_kg-0.12.3.tar.gz
- Upload date:
- Size: 93.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e710867468cc774100f2f7be7a6273b253ed1e50a11d863022cfc520f1027c0
|
|
| MD5 |
b1ebd0947f08a39217b77ace5bfe2605
|
|
| BLAKE2b-256 |
31ca03a66520e054e4979e70fea4485d32a6a9ccd7e94949977da045291e9726
|
File details
Details for the file doc_kg-0.12.3-py3-none-any.whl.
File metadata
- Download URL: doc_kg-0.12.3-py3-none-any.whl
- Upload date:
- Size: 109.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a648e15b4b253b046a73cb2b2a4bb5ecadb5690a26155a6f287f71f4ed030a8
|
|
| MD5 |
742246555b75bc8a6d59bcaecdb94d96
|
|
| BLAKE2b-256 |
efb4806065b6f93e261121f697395816d7691850e73da0c4409393d4ecfd3430
|