doc-kg

A tool to build a semantically searchable knowledge graph from markdown and text documents

These details have not been verified by PyPI

Project links

Project description

DocKG logo

DocKG — A Hybrid Knowledge Graph for Document Corpora with Semantic Indexing and Source-Grounded Passage Packing

Author: Eric G. Suchanek, PhD Flux-Frontiers, Liberty TWP, OH

Overview

DocKG constructs a deterministic, explainable knowledge graph from a corpus of Markdown, plain-text, and PDF documents. It semantically chunks text, discovers structural and semantic relationships between sections and chunks, stores them in SQLite, and augments retrieval with vector embeddings via LanceDB.

Structure is treated as ground truth; semantic search is strictly an acceleration layer. The result is a searchable, auditable representation of a document corpus — an ideal retrieval engine for LLMs and a practical foundation for Knowledge-Graph RAG (KRAG).

DocKG uses the same architecture as CodeKG but targets natural-language documents rather than Python source code.

Features

Multi-format ingestion — .md, .txt, .rst, and .pdf (native — no inference)
Semantic chunking — Heading-structure and paragraph-aware segmentation
Deterministic knowledge graph — SQLite-backed canonical store with typed nodes and provenance-tracked edges
Relation extraction — Topics, named entities, and keywords per chunk; co-occurrence and similarity edges built automatically
Hybrid query model — Semantic seeding (LanceDB embeddings) + structural expansion (graph traversal)
Passage packing — Context-rich text passages grounded to source documents with headings
Corpus health analysis — Per-document metrics, hot chunks, orphan detection, coverage report
Temporal snapshots — Save and diff graph metrics over time
MCP server — Four tools for AI agent integration (graph_stats, query_docs, pack_docs, get_node)
Streamlit web app — Interactive graph browser, hybrid query UI, and passage pack explorer

Quick Start

# Index a document corpus (SQLite + LanceDB in one step)
dockg build docs/

# Natural-language query — returns ranked document chunks
dockg query "authentication flow"

# Source-grounded passage pack — paste straight into an LLM prompt
dockg pack "configuration reference" --format md --out context.md

Installation

Requirements: Python ≥ 3.12, < 3.14

# pip
pip install doc-kg

# With Streamlit web visualizer
pip install 'doc-kg[viz]'

# Poetry
poetry add doc-kg

For advanced deployment options (Streamlit Cloud, Fly.io, offline model cache, git hooks) see docs/deployment.md.

Usage

Build and query

dockg build docs/                                    # full pipeline
dockg build docs/ --update                           # incremental (keep existing)
dockg build docs/ --exclude-dir archive              # skip directories
dockg query "deployment configuration"               # hybrid search
dockg pack "error handling" --format md --out ctx.md # passage pack

Analyze corpus health

dockg analyze docs/             # full report + JSON snapshot
dockg analyze docs/ --quiet     # CI mode — exits 1 on issues

Snapshots

dockg snapshot save 0.12.0      # capture current metrics
dockg snapshot diff 0.11.0 0.12.0  # compare two versions

Full flag reference for every command: docs/CLI.md Query patterns and MCP examples: docs/CHEATSHEET.md

MCP Integration

Start the MCP server, then wire it into your AI agent:

dockg mcp --repo docs/

Claude Code / Kilo Code — add to .mcp.json:

{
  "mcpServers": {
    "dockg": { "command": "dockg-mcp", "args": ["--repo", "."] }
  }
}

GitHub Copilot — add to .vscode/mcp.json:

{
  "servers": {
    "dockg": { "type": "stdio", "command": "dockg-mcp", "args": ["--repo", "."] }
  }
}

Tool	Description
`graph_stats()`	Node and edge counts by kind
`query_docs(q, k, hop)`	Hybrid semantic + structural search
`pack_docs(q, k, hop)`	Source-grounded passages as Markdown
`get_node(node_id)`	Fetch a single node by ID

Full provider setup (Claude Desktop, Cline, SSE transport): docs/MCP.md

Python API

from doc_kg import DocKG

kg = DocKG(corpus_root="docs/")
kg.build(wipe=True)

result = kg.query("deployment configuration", k=8, hop=1)
for node in result.nodes:
    print(node["id"], node["name"])

pack = kg.pack("authentication flow")
pack.save("context.md")

Knowledge Graph Schema

Node kinds

Kind	Description
`document`	A source `.md`, `.txt`, or `.pdf` file
`section`	A heading-delimited region within a document
`chunk`	A semantically coherent text passage
`topic`	A topic extracted from chunk text
`entity`	A named entity (person, place, org, concept)
`keyword`	A keyword or key phrase

Edge types

Type	Description
`CONTAINS`	Parent → child (document→section, section→chunk)
`NEXT`	Sequential ordering between same-level nodes
`REFERENCES`	Chunk references another document or section
`SIMILAR_TO`	Semantic similarity between chunks (LanceDB-derived)
`HAS_TOPIC`	Chunk → topic
`MENTIONS_ENTITY`	Chunk → named entity
`HAS_KEYWORD`	Chunk → keyword
`CO_OCCURS_WITH`	Co-occurrence between topics/entities within a chunk

Storage Layout

.dockg/
  graph.sqlite      # SQLite knowledge graph (nodes + edges)
  lancedb/          # LanceDB vector index
  snapshots/        # Temporal metric snapshots (JSON)
    manifest.json
    <commit>.json

Citation

If you use DocKG in research or a project, please cite it:

APA

Suchanek, E. G. (2026). DocKG: Hybrid Knowledge Graph for Document Corpora (Version 0.12.1) [Software]. Flux-Frontiers. https://doi.org/10.5281/zenodo.19770973

BibTeX

@software{suchanek_doc_kg,
  author    = {Suchanek, Eric G.},
  title     = {{DocKG}: Hybrid Knowledge Graph for Document Corpora},
  version   = {0.12.1},
  year      = {2026},
  publisher = {Flux-Frontiers},
  url       = {https://github.com/Flux-Frontiers/doc_kg},
  doi       = {10.5281/zenodo.19770973},
}

License

Elastic License 2.0 — free for non-commercial and internal use; commercial redistribution requires a license from Flux-Frontiers.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.14.0

May 4, 2026

This version

0.13.0

May 4, 2026

0.12.3

Apr 28, 2026

0.12.2

Apr 27, 2026

0.12.1

Apr 27, 2026

0.12.0

Apr 25, 2026

0.11.0

Apr 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_kg-0.13.0.tar.gz (94.0 kB view details)

Uploaded May 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

doc_kg-0.13.0-py3-none-any.whl (109.8 kB view details)

Uploaded May 4, 2026 Python 3

File details

Details for the file doc_kg-0.13.0.tar.gz.

File metadata

Download URL: doc_kg-0.13.0.tar.gz
Upload date: May 4, 2026
Size: 94.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for doc_kg-0.13.0.tar.gz
Algorithm	Hash digest
SHA256	`e98d9c4b41616fc9e4a4e02211fe744431b53b33989820ab234343c61eec7611`
MD5	`f8dd89f5a4099decfc9a8eacca7c50f9`
BLAKE2b-256	`11deda867836aa553f8b979bf29ca2098d4c83a7a2b475a3d374ce0c389fa463`

See more details on using hashes here.

File details

Details for the file doc_kg-0.13.0-py3-none-any.whl.

File metadata

Download URL: doc_kg-0.13.0-py3-none-any.whl
Upload date: May 4, 2026
Size: 109.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for doc_kg-0.13.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef818c35395f2a9c9aab2d3332b055d1ca19b902e9537d2dd4007a460efadb52`
MD5	`16e94c51871190e964b7a6b432b44001`
BLAKE2b-256	`010d5ae71952bf3397ffba6dbe6fb54a9455e94d812a5b7127154b97993d425c`

See more details on using hashes here.

doc-kg 0.13.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Features

Quick Start

Installation

Usage

Build and query

Analyze corpus health

Snapshots

MCP Integration

Python API

Knowledge Graph Schema

Node kinds

Edge types

Storage Layout

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes