Skip to main content

A knowledge graph builder and semantic search engine for diaries and journals

Project description

DiaryKG

Python License: Elastic-2.0 Version CI Poetry DOI

DiaryKG — A deterministic knowledge graph for diaries and journals with semantic indexing and source-grounded snippet packing.

Author: Eric G. Suchanek, PhD — Flux-Frontiers, Liberty TWP, OH


Overview

DiaryKG ingests plain-text diary or journal files and produces a hybrid SQLite + LanceDB knowledge graph that supports natural-language querying, source-grounded snippet packs for LLM context, temporal analysis, and topic/context classification.

It was built around the Samuel Pepys diary (1660–1669, 7,282 entries) but is general-purpose — any structured plain-text diary or journal file is supported.

The system is organized as two cooperating Python packages:

  • diary_transformer — spaCy NLP enrichment, topic classification, sentence-group chunking, diversity sampling. Turns a raw diary text file into one Markdown chunk-file per entry, with full provenance metadata.
  • diary_kg — orchestrates the chunking pipeline, builds the DocKG-backed SQLite graph + LanceDB vector index over the chunked corpus, and exposes the query / pack / analyze / snapshot APIs and an MCP server.

Architecture

Plain-text diary
       │
       ▼
DiaryTransformer          spaCy NLP enrichment, topic classification,
  (diary_transformer)     sentence-group chunking, diversity sampling
       │
       ▼
Corpus (.md files)        one file per chunk, full provenance metadata
  .diarykg/corpus/
       │
       ├──▶ DocKG build   SQLite graph + LanceDB vector index
       │     (doc-kg)     BAAI/bge-small-en-v1.5 (384-d, normalized)
       │
       └──▶ DiaryKG APIs  query(), pack(), analyze(), snapshot_save()

Storage layout

.diarykg/
  config.json         build parameters
  corpus/             one .md chunk file per diary entry
  graph.sqlite        SQLite knowledge graph (DocKG)
  lancedb/            LanceDB vector index (384-d HNSW)
  snapshots/          point-in-time metrics snapshots

Quick Start

# Install
pip install diary-kg

# Build from a plain-text diary file (creates .diarykg/ in the current dir)
diarykg build --source path/to/diary.txt

# Query the corpus
diarykg query "office work and the navy board"

# Pack snippets for an LLM context window
diarykg pack "Pepys at the theatre" --output context.md

# Start the MCP server (stdio transport for Claude Code / Cline / etc.)
diarykg-mcp

Installation

From PyPI (recommended)

# Core runtime (CLI + MCP server + graph engine)
pip install diary-kg

# With Streamlit / Plotly visualizer extras
pip install "diary-kg[viz]"

# With 3D visualization extras (PyVista, PyQt5, etc. — heavy dependencies)
pip install "diary-kg[viz3d]"

# With KG integration deps (pycode-kg, doc-kg)
pip install "diary-kg[kgdeps]"

# Everything
pip install "diary-kg[all]"

Poetry project

poetry add diary-kg
poetry add "diary-kg[viz]"
poetry add "diary-kg[kgdeps]"

Local development

git clone https://github.com/Flux-Frontiers/diary_kg.git
cd diary_kg
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
poetry run pytest

CLI Reference

The diarykg console script is the primary entry point. The MCP server ships as a separate diarykg-mcp script.

Command Purpose
diarykg build Full pipeline: ingest diary → chunk → index into SQLite + LanceDB
diarykg reindex Rebuild the LanceDB + SQLite index from the existing corpus (skips ingest)
diarykg query <QUERY> Hybrid semantic + graph search; returns ranked hits
diarykg pack <QUERY> Source-grounded Markdown snippet pack for LLM context
diarykg analyze Generate a Markdown analysis report for the corpus
diarykg status KG health check and build metadata, without loading the full DB
diarykg snapshot save Capture point-in-time corpus metrics
diarykg snapshot list / show / diff / prune Inspect and prune snapshots
diarykg install-hooks Install the DiaryKG pre-commit git hook
diarykg-mcp Run the MCP server (stdio / SSE transport)

Every command accepts a ROOT positional argument (default: current directory) pointing at the project that contains .diarykg/. Run diarykg <command> --help for the full option list.

Build

# First build — --source is required
diarykg build --source pepys/pepys_enriched_full.txt

# Incremental update (preserve existing corpus + DBs)
diarykg build --source pepys/pepys_enriched_full.txt --update

# Configure chunking
diarykg build --source diary.txt --chunking semantic --chunk-size 800 --max-chunks 5

# Capture a snapshot immediately after the build
diarykg build --source diary.txt --snapshot

Chunking strategies: sentence_group (default), semantic, hybrid. Custom topic catalogs can be supplied with --topics-file path/to/topics.yaml.

Query and pack

# Top-k semantic hits as a rich-formatted table
diarykg query "Navy affairs" -k 12

# Same query as JSON for downstream tooling
diarykg query "Navy affairs" --json

# Markdown snippet pack ready to paste into an LLM
diarykg pack "Pepys wife Elizabeth" --output context.md

Snapshots

Version is an option (-v / --version), not a positional argument; bare positionals are treated as ROOT.

# Capture a snapshot at the current corpus state
diarykg snapshot save -v 0.92.2

# With a label
diarykg snapshot save -v 0.92.2 -l "after backfilling 1667 entries"

# List, inspect, compare
diarykg snapshot list
diarykg snapshot show <key>
diarykg snapshot diff <key_a> <key_b>

# Prune snapshots that carry no new metric information
diarykg snapshot prune --dry-run

Snapshots are keyed by git tree hash and capture chunk/entry/node/edge counts, temporal span, topic/context distributions, and deltas vs. the previous and baseline snapshots.

Reindex

Use after changing the embedding model or fixing an index bug, when the corpus .md chunk files are already up-to-date.

diarykg reindex

MCP Server

DiaryKG ships an MCP server that exposes three tools to AI agents.

Tool Returns Description
query_diary(q, k) JSON Semantic search over the diary corpus; ranked hit list with node_id, score, summary, source_file, timestamp, category, context.
pack_diary(q, k) Markdown Top-k diary snippets formatted as Markdown sections, ready to paste into an LLM context window.
diary_stats() JSON Combined corpus metadata (info()) and KG stats (stats()): chunk/entry counts, temporal span, topic/context distributions, node/edge counts.

Run the server

# Stdio transport (default — for Claude Code / Cline / Claude Desktop / Kilo Code)
diarykg-mcp --repo /path/to/diary_project

# SSE transport
diarykg-mcp --repo /path/to/diary_project --transport sse

Wire it up in an MCP client

Most MCP clients use a JSON config file. Example .mcp.json for Claude Code or Kilo Code:

{
  "mcpServers": {
    "diarykg": {
      "command": "diarykg-mcp",
      "args": ["--repo", "/absolute/path/to/diary_project"]
    }
  }
}

For per-agent setup steps, run /setup-diarykg-mcp in Claude Code (the slash command at .claude/commands/setup-diarykg-mcp.md walks through the Claude Code, Cline, Claude Desktop, GitHub Copilot, and Kilo Code variants).


Python API

from diary_kg import DiaryKG

# First build
kg = DiaryKG("/path/to/project", source_file="pepys_diary.txt")
kg.build()

# Subsequent runs only need the project root
kg = DiaryKG("/path/to/project")

# Hybrid semantic + graph search
hits = kg.query("what did Pepys think of the theatre?", k=12)

# Source-grounded snippet pack (list of dicts with content, metadata)
snippets = kg.pack("Navy corruption", k=8)

# Corpus metadata + KG stats
info = kg.info()        # chunk_count, entry_count, temporal_span, topic/context distributions
stats = kg.stats()      # node_count, edge_count

# Markdown analysis report
report = kg.analyze()

# Snapshots
kg.snapshot_save(version="0.92.2", label="release")
kg.snapshot_list()
kg.snapshot_show(key)
kg.snapshot_diff(key_a, key_b)

The package re-exports the primary types:

from diary_kg import DiaryKG, DEFAULT_MODEL, CrossHit, CrossSnippet, KGEntry, KGKind

Embedding Model

Use Model Dims Notes
Knowledge graph build BAAI/bge-small-en-v1.5 384 Fast, general-text, L2-normalized
Multipass pipeline BAAI/bge-small-en-v1.5 384 Same model stack-wide; loaded via kg_utils.embedder.load_sentence_transformer()

Model loading is handled by kg_utils.embedder.load_sentence_transformer(), which enforces local_files_only=True when a cached copy exists — preventing spurious HuggingFace HEAD requests in offline or air-gapped environments.


Project Structure

diary_kg/
├── src/
│   ├── diary_kg/                 DiaryKG package
│   │   ├── kg.py                 DiaryKG class (build, query, pack, analyze, snapshots)
│   │   ├── cli.py                Click CLI — `diarykg` console script
│   │   ├── mcp_server.py         MCP server — `diarykg-mcp` console script
│   │   ├── primitives.py         CrossHit, CrossSnippet, KGEntry, KGKind
│   │   ├── snapshots.py          DiarySnapshotManager
│   │   └── module/               Pluggable KGModule interface
│   └── diary_transformer/        Chunking + NLP pipeline
│       ├── transformer.py        DiaryTransformer orchestrator
│       ├── chunker.py            sentence_group / semantic / hybrid chunkers
│       ├── classifier.py         Topic + context classification
│       ├── parser.py             Diary file parser
│       ├── topic_classifier.py   Hybrid keyword / K-means classifier
│       └── topics.yaml           Default topic catalog
├── pepys/                        Sample Pepys diary corpus
├── docs/                         Technical articles and disclosures
├── benchmarks/                   Embedding model benchmarks
├── analysis/                     Versioned analysis reports
├── tests/                        Pytest suite
└── scripts/                      Wiki generator, embedder benchmarks

Dependencies

  • doc-kg ≥ 0.12.0 — hybrid semantic + structural document knowledge graph
  • kgmodule-utils ≥ 0.2.3 — shared embedding, model cache, and snapshot utilities
  • spacy ≥ 3.8 with en_core_web_sm model
  • sentence-transformers ≥ 5.4
  • lancedb ≥ 0.29
  • transformers ≥ 4.57
  • mcp ≥ 1.0 — Model Context Protocol SDK
  • rich ≥ 14.3 — terminal output and progress bars

Optional extras (viz, viz3d, kgdeps, dev) are documented in pyproject.toml.


Development

# Install with dev tools
pip install -e ".[dev]"

# Run the test suite
pytest                          # uses pytest.ini (testpaths = tests/)
pytest -m "not slow"            # skip slow tests
pytest --cov=diary_kg           # with coverage

# Lint and format
ruff check src tests
ruff format src tests
mypy src/

# Pre-commit (runs ruff, mypy, pytest, detect-secrets, pylint)
pre-commit run --all-files

The repo ships an optional pre-commit git hook that rebuilds PyCodeKG and DocKG indices from staged content, captures metrics snapshots keyed by git tree hash, and stages .pycodekg/snapshots/ and .dockg/snapshots/ atomically before the standard pre-commit framework checks run. Install it with:

diarykg install-hooks --repo .
# Skip per-commit with: DIARYKG_SKIP_SNAPSHOT=1 git commit ...

Example Corpus: corpus_pepys

corpus_pepys is the reference implementation — the complete diary of Samuel Pepys (1660–1669) packaged as a self-contained Docker image with a KGRAG query API and a Pepys-specific Streamlit chat UI.

git clone https://github.com/Flux-Frontiers/corpus_pepys
cd corpus_pepys
make build-index   # build DiaryKG from the included corpus (~3 min)
make build-image   # bake index into Docker image
make run           # KGRAG API on http://localhost:8000
make chat          # Streamlit chat UI on http://localhost:8501

Query with curl:

curl -s -X POST http://localhost:8000/runsync \
  -H "Content-Type: application/json" \
  -d '{"input":{"query":"Great Fire of London","corpus":"pepys","k":5}}' | jq .

The repo includes 3,355 parsed diary entries, 7,282 NLP-enriched chunks, topic configs, processing scripts, and the full technical write-ups from docs/. It is the canonical example of how to stand up a DiaryKG corpus as a portable, air-gapped Docker service.


Brand & Logo

Logo generation prompt and brand guidelines (color palette, style rules, family DNA) are in assets/brands.md.

DiaryKG accent color: Rose #FF6B8A.


License

Elastic License 2.0 — see LICENSE and the Elastic License page.

Citation

If you use DiaryKG in academic work, please cite via the metadata in CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diary_kg-0.92.6.tar.gz (66.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

diary_kg-0.92.6-py3-none-any.whl (71.9 kB view details)

Uploaded Python 3

File details

Details for the file diary_kg-0.92.6.tar.gz.

File metadata

  • Download URL: diary_kg-0.92.6.tar.gz
  • Upload date:
  • Size: 66.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for diary_kg-0.92.6.tar.gz
Algorithm Hash digest
SHA256 99e3d85815acaf63f62dd266fa404495fb4e8aac2a712b446694ae1118a97d7a
MD5 15163b101e3a5f7d11e8e7ed3952f2c7
BLAKE2b-256 62af59c03397974da9394139bce810450c8c6b1da5773ec94f5c283e0786ec1c

See more details on using hashes here.

File details

Details for the file diary_kg-0.92.6-py3-none-any.whl.

File metadata

  • Download URL: diary_kg-0.92.6-py3-none-any.whl
  • Upload date:
  • Size: 71.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for diary_kg-0.92.6-py3-none-any.whl
Algorithm Hash digest
SHA256 64aaf85912a74691be28dbee55c93c8ca35a8162b1fb9fbd6bc70477a1311fd4
MD5 be7ad234966ea4df6db26c75237afced
BLAKE2b-256 4b58230bca23479d56ac632c2fcddc13104bf52bf29fa804e633aee2dc1a0f92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page