diary-kg

A knowledge graph builder and semantic search engine for diaries and journals

These details have not been verified by PyPI

Project links

Project description

DiaryKG

DiaryKG — A deterministic knowledge graph for diaries and journals with semantic indexing and source-grounded snippet packing.

Author: Eric G. Suchanek, PhD — Flux-Frontiers, Liberty TWP, OH

Overview

DiaryKG ingests plain-text diary or journal files and produces a hybrid SQLite + LanceDB knowledge graph that supports natural-language querying, source-grounded snippet packs for LLM context, temporal analysis, and topic/context classification.

It was built around the Samuel Pepys diary (1660–1669, 7,282 entries) but is general-purpose — any structured plain-text diary or journal file is supported.

The system is organized as two cooperating Python packages:

diary_transformer — spaCy NLP enrichment, topic classification, sentence-group chunking, diversity sampling. Turns a raw diary text file into one Markdown chunk-file per entry, with full provenance metadata.
diary_kg — orchestrates the chunking pipeline, builds the DocKG-backed SQLite graph + LanceDB vector index over the chunked corpus, and exposes the query / pack / analyze / snapshot APIs and an MCP server.

Architecture

Plain-text diary
       │
       ▼
DiaryTransformer          spaCy NLP enrichment, topic classification,
  (diary_transformer)     sentence-group chunking, diversity sampling
       │
       ▼
Corpus (.md files)        one file per chunk, full provenance metadata
  .diarykg/corpus/
       │
       ├──▶ DocKG build   SQLite graph + LanceDB vector index
       │     (doc-kg)     BAAI/bge-small-en-v1.5 (384-d, normalized)
       │
       └──▶ DiaryKG APIs  query(), pack(), analyze(), snapshot_save()

Storage layout

.diarykg/
  config.json         build parameters
  corpus/             one .md chunk file per diary entry
  graph.sqlite        SQLite knowledge graph (DocKG)
  lancedb/            LanceDB vector index (384-d HNSW)
  snapshots/          point-in-time metrics snapshots

Quick Start

# Install
pip install diary-kg

# Build from a plain-text diary file (creates .diarykg/ in the current dir)
diarykg build --source path/to/diary.txt

# Query the corpus
diarykg query "office work and the navy board"

# Pack snippets for an LLM context window
diarykg pack "Pepys at the theatre" --output context.md

# Start the MCP server (stdio transport for Claude Code / Cline / etc.)
diarykg-mcp

Installation

From PyPI (recommended)

# Core runtime (CLI + MCP server + graph engine)
pip install diary-kg

# With Streamlit / Plotly visualizer extras
pip install "diary-kg[viz]"

# With 3D visualization extras (PyVista, PyQt5, etc. — heavy dependencies)
pip install "diary-kg[viz3d]"

# With KG integration deps (pycode-kg, doc-kg)
pip install "diary-kg[kgdeps]"

# Everything
pip install "diary-kg[all]"

Poetry project

poetry add diary-kg
poetry add "diary-kg[viz]"
poetry add "diary-kg[kgdeps]"

Local development

git clone https://github.com/Flux-Frontiers/diary_kg.git
cd diary_kg
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
poetry run pytest

CLI Reference

The diarykg console script is the primary entry point. The MCP server ships as a separate diarykg-mcp script.

Command	Purpose
`diarykg build`	Full pipeline: ingest diary → chunk → index into SQLite + LanceDB
`diarykg reindex`	Rebuild the LanceDB + SQLite index from the existing corpus (skips ingest)
`diarykg query <QUERY>`	Hybrid semantic + graph search; returns ranked hits
`diarykg pack <QUERY>`	Source-grounded Markdown snippet pack for LLM context
`diarykg analyze`	Generate a Markdown analysis report for the corpus
`diarykg status`	KG health check and build metadata, without loading the full DB
`diarykg snapshot save`	Capture point-in-time corpus metrics
`diarykg snapshot list / show / diff / prune`	Inspect and prune snapshots
`diarykg install-hooks`	Install the DiaryKG pre-commit git hook
`diarykg-mcp`	Run the MCP server (stdio / SSE transport)

Every command accepts a ROOT positional argument (default: current directory) pointing at the project that contains .diarykg/. Run diarykg <command> --help for the full option list.

Build

# First build — --source is required
diarykg build --source pepys/pepys_enriched_full.txt

# Incremental update (preserve existing corpus + DBs)
diarykg build --source pepys/pepys_enriched_full.txt --update

# Configure chunking
diarykg build --source diary.txt --chunking semantic --chunk-size 800 --max-chunks 5

# Capture a snapshot immediately after the build
diarykg build --source diary.txt --snapshot

Chunking strategies: sentence_group (default), semantic, hybrid. Custom topic catalogs can be supplied with --topics-file path/to/topics.yaml.

Query and pack

# Top-k semantic hits as a rich-formatted table
diarykg query "Navy affairs" -k 12

# Same query as JSON for downstream tooling
diarykg query "Navy affairs" --json

# Markdown snippet pack ready to paste into an LLM
diarykg pack "Pepys wife Elizabeth" --output context.md

Snapshots

Version is an option (-v / --version), not a positional argument; bare positionals are treated as ROOT.

# Capture a snapshot at the current corpus state
diarykg snapshot save -v 0.92.2

# With a label
diarykg snapshot save -v 0.92.2 -l "after backfilling 1667 entries"

# List, inspect, compare
diarykg snapshot list
diarykg snapshot show <key>
diarykg snapshot diff <key_a> <key_b>

# Prune snapshots that carry no new metric information
diarykg snapshot prune --dry-run

Snapshots are keyed by git tree hash and capture chunk/entry/node/edge counts, temporal span, topic/context distributions, and deltas vs. the previous and baseline snapshots.

Reindex

Use after changing the embedding model or fixing an index bug, when the corpus .md chunk files are already up-to-date.

diarykg reindex

MCP Server

DiaryKG ships an MCP server that exposes three tools to AI agents.

Tool	Returns	Description
`query_diary(q, k)`	JSON	Semantic search over the diary corpus; ranked hit list with `node_id`, `score`, `summary`, `source_file`, `timestamp`, `category`, `context`.
`pack_diary(q, k)`	Markdown	Top-k diary snippets formatted as Markdown sections, ready to paste into an LLM context window.
`diary_stats()`	JSON	Combined corpus metadata (`info()`) and KG stats (`stats()`): chunk/entry counts, temporal span, topic/context distributions, node/edge counts.

Run the server

# Stdio transport (default — for Claude Code / Cline / Claude Desktop / Kilo Code)
diarykg-mcp --repo /path/to/diary_project

# SSE transport
diarykg-mcp --repo /path/to/diary_project --transport sse

Wire it up in an MCP client

Most MCP clients use a JSON config file. Example .mcp.json for Claude Code or Kilo Code:

{
  "mcpServers": {
    "diarykg": {
      "command": "diarykg-mcp",
      "args": ["--repo", "/absolute/path/to/diary_project"]
    }
  }
}

For per-agent setup steps, run /setup-diarykg-mcp in Claude Code (the slash command at .claude/commands/setup-diarykg-mcp.md walks through the Claude Code, Cline, Claude Desktop, GitHub Copilot, and Kilo Code variants).

Python API

from diary_kg import DiaryKG

# First build
kg = DiaryKG("/path/to/project", source_file="pepys_diary.txt")
kg.build()

# Subsequent runs only need the project root
kg = DiaryKG("/path/to/project")

# Hybrid semantic + graph search
hits = kg.query("what did Pepys think of the theatre?", k=12)

# Source-grounded snippet pack (list of dicts with content, metadata)
snippets = kg.pack("Navy corruption", k=8)

# Corpus metadata + KG stats
info = kg.info()        # chunk_count, entry_count, temporal_span, topic/context distributions
stats = kg.stats()      # node_count, edge_count

# Markdown analysis report
report = kg.analyze()

# Snapshots
kg.snapshot_save(version="0.92.2", label="release")
kg.snapshot_list()
kg.snapshot_show(key)
kg.snapshot_diff(key_a, key_b)

The package re-exports the primary types:

from diary_kg import DiaryKG, DEFAULT_MODEL, CrossHit, CrossSnippet, KGEntry, KGKind

Embedding Model

Use	Model	Dims	Notes
Knowledge graph build	`BAAI/bge-small-en-v1.5`	384	Fast, general-text, L2-normalized
Multipass pipeline	`BAAI/bge-small-en-v1.5`	384	Same model stack-wide; loaded via `kg_utils.embedder.load_sentence_transformer()`

Model loading is handled by kg_utils.embedder.load_sentence_transformer(), which enforces local_files_only=True when a cached copy exists — preventing spurious HuggingFace HEAD requests in offline or air-gapped environments.

Project Structure

diary_kg/
├── src/
│   ├── diary_kg/                 DiaryKG package
│   │   ├── kg.py                 DiaryKG class (build, query, pack, analyze, snapshots)
│   │   ├── cli.py                Click CLI — `diarykg` console script
│   │   ├── mcp_server.py         MCP server — `diarykg-mcp` console script
│   │   ├── primitives.py         CrossHit, CrossSnippet, KGEntry, KGKind
│   │   ├── snapshots.py          DiarySnapshotManager
│   │   └── module/               Pluggable KGModule interface
│   └── diary_transformer/        Chunking + NLP pipeline
│       ├── transformer.py        DiaryTransformer orchestrator
│       ├── chunker.py            sentence_group / semantic / hybrid chunkers
│       ├── classifier.py         Topic + context classification
│       ├── parser.py             Diary file parser
│       ├── topic_classifier.py   Hybrid keyword / K-means classifier
│       └── topics.yaml           Default topic catalog
├── pepys/                        Sample Pepys diary corpus
├── docs/                         Technical articles and disclosures
├── benchmarks/                   Embedding model benchmarks
├── analysis/                     Versioned analysis reports
├── tests/                        Pytest suite
└── scripts/                      Wiki generator, embedder benchmarks

Dependencies

doc-kg ≥ 0.12.0 — hybrid semantic + structural document knowledge graph
kgmodule-utils ≥ 0.2.3 — shared embedding, model cache, and snapshot utilities
spacy ≥ 3.8 with en_core_web_sm model
sentence-transformers ≥ 5.4
lancedb ≥ 0.29
transformers ≥ 4.57
mcp ≥ 1.0 — Model Context Protocol SDK
rich ≥ 14.3 — terminal output and progress bars

Optional extras (viz, viz3d, kgdeps, dev) are documented in pyproject.toml.

Development

# Install with dev tools
pip install -e ".[dev]"

# Run the test suite
pytest                          # uses pytest.ini (testpaths = tests/)
pytest -m "not slow"            # skip slow tests
pytest --cov=diary_kg           # with coverage

# Lint and format
ruff check src tests
ruff format src tests
mypy src/

# Pre-commit (runs ruff, mypy, pytest, detect-secrets, pylint)
pre-commit run --all-files

The repo ships an optional pre-commit git hook that rebuilds PyCodeKG and DocKG indices from staged content, captures metrics snapshots keyed by git tree hash, and stages .pycodekg/snapshots/ and .dockg/snapshots/ atomically before the standard pre-commit framework checks run. Install it with:

diarykg install-hooks --repo .
# Skip per-commit with: DIARYKG_SKIP_SNAPSHOT=1 git commit ...

Example Corpus: corpus_pepys

corpus_pepys is the reference implementation — the complete diary of Samuel Pepys (1660–1669) packaged as a self-contained Docker image with a KGRAG query API and a Pepys-specific Streamlit chat UI.

git clone https://github.com/Flux-Frontiers/corpus_pepys
cd corpus_pepys
make build-index   # build DiaryKG from the included corpus (~3 min)
make build-image   # bake index into Docker image
make run           # KGRAG API on http://localhost:8000
make chat          # Streamlit chat UI on http://localhost:8501

Query with curl:

curl -s -X POST http://localhost:8000/runsync \
  -H "Content-Type: application/json" \
  -d '{"input":{"query":"Great Fire of London","corpus":"pepys","k":5}}' | jq .

The repo includes 3,355 parsed diary entries, 7,282 NLP-enriched chunks, topic configs, processing scripts, and the full technical write-ups from docs/. It is the canonical example of how to stand up a DiaryKG corpus as a portable, air-gapped Docker service.

Brand & Logo

Logo generation prompt and brand guidelines (color palette, style rules, family DNA) are in assets/brands.md.

DiaryKG accent color: Rose #FF6B8A.

License

Elastic License 2.0 — see LICENSE and the Elastic License page.

Citation

If you use DiaryKG in academic work, please cite via the metadata in CITATION.cff.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.92.6

May 22, 2026

0.92.5

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diary_kg-0.92.6.tar.gz (66.4 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

diary_kg-0.92.6-py3-none-any.whl (71.9 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file diary_kg-0.92.6.tar.gz.

File metadata

Download URL: diary_kg-0.92.6.tar.gz
Upload date: May 22, 2026
Size: 66.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for diary_kg-0.92.6.tar.gz
Algorithm	Hash digest
SHA256	`99e3d85815acaf63f62dd266fa404495fb4e8aac2a712b446694ae1118a97d7a`
MD5	`15163b101e3a5f7d11e8e7ed3952f2c7`
BLAKE2b-256	`62af59c03397974da9394139bce810450c8c6b1da5773ec94f5c283e0786ec1c`

See more details on using hashes here.

File details

Details for the file diary_kg-0.92.6-py3-none-any.whl.

File metadata

Download URL: diary_kg-0.92.6-py3-none-any.whl
Upload date: May 22, 2026
Size: 71.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0

File hashes

Hashes for diary_kg-0.92.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`64aaf85912a74691be28dbee55c93c8ca35a8162b1fb9fbd6bc70477a1311fd4`
MD5	`be7ad234966ea4df6db26c75237afced`
BLAKE2b-256	`4b58230bca23479d56ac632c2fcddc13104bf52bf29fa804e633aee2dc1a0f92`

See more details on using hashes here.

diary-kg 0.92.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Overview

Architecture

Storage layout

Quick Start

Installation

From PyPI (recommended)

Poetry project

Local development

CLI Reference

Build

Query and pack

Snapshots

Reindex

MCP Server

Run the server

Wire it up in an MCP client

Python API

Embedding Model

Project Structure

Dependencies

Development

Example Corpus: corpus_pepys

Brand & Logo

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes