Skip to main content

A tool to build a semantically searchable knowledge graph from markdown and text documents

Project description

CI Python License: Elastic-2.0 Version Poetry DOI

DocKG — A Hybrid Knowledge Graph for Document Corpora with Semantic Indexing and Source-Grounded Passage Packing

Author: Eric G. Suchanek, PhD Flux-Frontiers, Liberty TWP, OH


Overview

DocKG constructs a deterministic, explainable knowledge graph from a corpus of Markdown, plain-text, and PDF documents. It semantically chunks text, discovers structural and semantic relationships between sections and chunks, stores them in SQLite, and augments retrieval with vector embeddings via LanceDB.

Structure is treated as ground truth; semantic search is strictly an acceleration layer. The result is a searchable, auditable representation of a document corpus — an ideal retrieval engine for LLMs and a practical foundation for Knowledge-Graph RAG (KRAG).

DocKG uses the same architecture as CodeKG but targets natural-language documents rather than Python source code.


Features

  • Multi-format ingestion.md, .txt, .rst, and .pdf (native — no inference)
  • Semantic chunking — Heading-structure and paragraph-aware segmentation
  • Deterministic knowledge graph — SQLite-backed canonical store with typed nodes and provenance-tracked edges
  • Relation extraction — Topics, named entities, and keywords per chunk; co-occurrence and similarity edges built automatically
  • Hybrid query model — Semantic seeding (LanceDB embeddings) + structural expansion (graph traversal)
  • Passage packing — Context-rich text passages grounded to source documents with headings
  • Corpus health analysis — Per-document metrics, hot chunks, orphan detection, coverage report
  • Temporal snapshots — Save and diff graph metrics over time
  • MCP server — Four tools for AI agent integration (graph_stats, query_docs, pack_docs, get_node)
  • Streamlit web app — Interactive graph browser, hybrid query UI, and passage pack explorer

Quick Start

# Index a document corpus (SQLite + LanceDB in one step)
dockg build docs/

# Natural-language query — returns ranked document chunks
dockg query "authentication flow"

# Source-grounded passage pack — paste straight into an LLM prompt
dockg pack "configuration reference" --format md --out context.md

Installation

Requirements: Python ≥ 3.12, < 3.14

# pip
pip install doc-kg

# With Streamlit web visualizer
pip install 'doc-kg[viz]'

# Poetry
poetry add doc-kg

For advanced deployment options (Streamlit Cloud, Fly.io, offline model cache, git hooks) see docs/deployment.md.


Usage

Build and query

dockg build docs/                                    # full pipeline
dockg build docs/ --update                           # incremental (keep existing)
dockg build docs/ --exclude-dir archive              # skip directories
dockg query "deployment configuration"               # hybrid search
dockg pack "error handling" --format md --out ctx.md # passage pack

Analyze corpus health

dockg analyze docs/             # full report + JSON snapshot
dockg analyze docs/ --quiet     # CI mode — exits 1 on issues

Snapshots

dockg snapshot save 0.12.0      # capture current metrics
dockg snapshot diff 0.11.0 0.12.0  # compare two versions

Full flag reference for every command: docs/CLI.md Query patterns and MCP examples: docs/CHEATSHEET.md


MCP Integration

Start the MCP server, then wire it into your AI agent:

dockg mcp --repo docs/

Claude Code / Kilo Code — add to .mcp.json:

{
  "mcpServers": {
    "dockg": { "command": "dockg-mcp", "args": ["--repo", "."] }
  }
}

GitHub Copilot — add to .vscode/mcp.json:

{
  "servers": {
    "dockg": { "type": "stdio", "command": "dockg-mcp", "args": ["--repo", "."] }
  }
}
Tool Description
graph_stats() Node and edge counts by kind
query_docs(q, k, hop) Hybrid semantic + structural search
pack_docs(q, k, hop) Source-grounded passages as Markdown
get_node(node_id) Fetch a single node by ID

Full provider setup (Claude Desktop, Cline, SSE transport): docs/MCP.md


Python API

from doc_kg import DocKG

kg = DocKG(corpus_root="docs/")
kg.build(wipe=True)

result = kg.query("deployment configuration", k=8, hop=1)
for node in result.nodes:
    print(node["id"], node["name"])

pack = kg.pack("authentication flow")
pack.save("context.md")

Knowledge Graph Schema

Node kinds

Kind Description
document A source .md, .txt, or .pdf file
section A heading-delimited region within a document
chunk A semantically coherent text passage
topic A topic extracted from chunk text
entity A named entity (person, place, org, concept)
keyword A keyword or key phrase

Edge types

Type Description
CONTAINS Parent → child (document→section, section→chunk)
NEXT Sequential ordering between same-level nodes
REFERENCES Chunk references another document or section
SIMILAR_TO Semantic similarity between chunks (LanceDB-derived)
HAS_TOPIC Chunk → topic
MENTIONS_ENTITY Chunk → named entity
HAS_KEYWORD Chunk → keyword
CO_OCCURS_WITH Co-occurrence between topics/entities within a chunk

Storage Layout

.dockg/
  graph.sqlite      # SQLite knowledge graph (nodes + edges)
  lancedb/          # LanceDB vector index
  snapshots/        # Temporal metric snapshots (JSON)
    manifest.json
    <commit>.json

Citation

If you use DocKG in research or a project, please cite it:

DOI

APA

Suchanek, E. G. (2026). DocKG: Hybrid Knowledge Graph for Document Corpora (Version 0.12.1) [Software]. Flux-Frontiers. https://doi.org/10.5281/zenodo.19770973

BibTeX

@software{suchanek_doc_kg,
  author    = {Suchanek, Eric G.},
  title     = {{DocKG}: Hybrid Knowledge Graph for Document Corpora},
  version   = {0.12.1},
  year      = {2026},
  publisher = {Flux-Frontiers},
  url       = {https://github.com/Flux-Frontiers/doc_kg},
  doi       = {10.5281/zenodo.19770973},
}

License

Elastic License 2.0 — free for non-commercial and internal use; commercial redistribution requires a license from Flux-Frontiers.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_kg-0.12.3.tar.gz (93.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_kg-0.12.3-py3-none-any.whl (109.6 kB view details)

Uploaded Python 3

File details

Details for the file doc_kg-0.12.3.tar.gz.

File metadata

  • Download URL: doc_kg-0.12.3.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for doc_kg-0.12.3.tar.gz
Algorithm Hash digest
SHA256 4e710867468cc774100f2f7be7a6273b253ed1e50a11d863022cfc520f1027c0
MD5 b1ebd0947f08a39217b77ace5bfe2605
BLAKE2b-256 31ca03a66520e054e4979e70fea4485d32a6a9ccd7e94949977da045291e9726

See more details on using hashes here.

File details

Details for the file doc_kg-0.12.3-py3-none-any.whl.

File metadata

  • Download URL: doc_kg-0.12.3-py3-none-any.whl
  • Upload date:
  • Size: 109.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for doc_kg-0.12.3-py3-none-any.whl
Algorithm Hash digest
SHA256 3a648e15b4b253b046a73cb2b2a4bb5ecadb5690a26155a6f287f71f4ed030a8
MD5 742246555b75bc8a6d59bcaecdb94d96
BLAKE2b-256 efb4806065b6f93e261121f697395816d7691850e73da0c4409393d4ecfd3430

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page