Skip to main content

AI-native knowledge engine across the DIKW pyramid (Data → Information → Knowledge → Wisdom)

Project description

dikw-core

CI CodeQL Coverage PyPI Python License

AI-native knowledge engine that turns your documents into Data → Information → Knowledge → Wisdom.

Inspired by Karpathy's LLM Wiki pattern, extended end-to-end across the full DIKW pyramid. Where Karpathy's pattern stops at a compounding markdown knowledge base (the K layer), dikw-core adds a first-class Wisdom layer for human-authored principles, lessons, and patterns that apply beyond any single source.

Status: alpha. Under active construction; APIs, on-disk formats, database schema, and CLI will change.

What you get

  • A local-first knowledge base — the dikw base — where the on-disk layout is a plain markdown tree your editor (Obsidian, VS Code, …) can open directly.
  • Four explicit DIKW layers with their own operations:
    • Data — raw sources you curate.
    • Information — parsed, chunked, embedded, indexed (FTS5 + vectors).
    • Knowledge — LLM-authored knowledge pages with [[wikilinks]], index.md, and an append-only log.md.
    • Wisdom — hand-written markdown principles / lessons / patterns authored under wisdom/<author>/, indexed (chunked + embedded) so they surface in retrieve alongside K-layer pages.
  • Pluggable LLM providers (API-first): Anthropic + OpenAI-compatible (covers OpenAI, Azure, Ollama, DeepSeek, Gemini-compat).
  • Pluggable storage: SQLite+sqlite-vec (default), Postgres+pgvector (enterprise) — swap by config.
  • Client / server architecture. A long-lived dikw serve (FastAPI + NDJSON) hosts the engine; the dikw client … Typer CLI talks to it over HTTP, streams progress events for long ops, and supports cancel / resume.

Install & quick start

Requires Python 3.12+ and uv.

git clone https://github.com/OpenDIKW/dikw-core
cd dikw-core
uv sync

uv run dikw init my-base --description "my research base"
cd my-base
# Drop some markdown into sources/, then run any single command via
# `dikw client serve-and-run` — it spawns a local server, runs the
# inner command, and tears it down.
uv run dikw client serve-and-run -- ingest --no-embed
uv run dikw client serve-and-run -- retrieve "What does Karpathy mean by deterministic scoping?"

For interactive sessions or long iterations, run dikw serve once and keep using dikw client * against it:

uv run dikw serve --base .   # in one terminal
# in another:
uv run dikw client status
uv run dikw client synth               # K layer (needs ANTHROPIC_API_KEY or OpenAI-compat)
uv run dikw client retrieve "What does Karpathy mean by deterministic scoping?"

Every HTTP-bound command is spelled out as dikw client <verb>; there are no top-level short aliases. dikw-core no longer ships an in-engine answer-synthesis pathretrieve returns ranked chunks + page refs and the agent (Claude Code, ChatGPT, your own script) feeds them into its own LLM. See GUIDE_FOR_AGENTS.md.

Server deployment, security posture, and the wire contract live in docs/server.md. For container deployment, see examples/docker/ (Dockerfile + compose stack with pgvector/pgvector:0.8.2-pg18) and the long-form docs/deployment-docker.md.

End-to-end walkthrough: docs/getting-started.md. Architecture brief: docs/architecture.md. Approved design doc: docs/design.md.

Commands

Local-only commands run in this process:

command does
dikw version print the package version
dikw init <path> scaffold a dikw base (sources / knowledge / wisdom / .dikw/ + dikw.yml)
dikw serve --base <path> start the FastAPI + NDJSON server bound to one base

Everything else lives under dikw client * and talks to a running server. There are no top-level short aliases — spelling out the client prefix keeps the local-vs-HTTP boundary unambiguous for both agents and humans:

command does
dikw client status counts across DIKW layers
dikw client info raw GET /v1/info passthrough — version, storage backend, auth posture
dikw client health server self-description (base, version, storage, providers) — the first call an agent makes
dikw client check ping the configured LLM + embedding endpoints to verify dikw.yml + keys
dikw client import <path> pre-flight + import local md packages (md + referenced assets) into the server's sources/
dikw client ingest [--no-embed] parse + chunk + FTS-index + embed the server's sources/ tree
dikw client retrieve "<q>" hybrid search returning ranked chunks + page refs (no LLM call); agent supplies its own synthesis
dikw client synth [--all] LLM turns source docs into K-layer knowledge pages; maintains index.md+log.md
dikw client lint [propose|proposals|apply] report broken wikilinks / orphan pages / duplicate titles; propose + apply structured fixes
dikw client pages {list,get,links,provenance} enumerate pages / read a page body + chunk anchors / walk the K-layer link graph / walk the K↔D provenance edge
dikw client graph get fetch the whole base graph (nodes + edges + unresolved wikilinks) in one read
dikw client assets get <id> --output <file> download a content-addressed asset by sha256 id
dikw client eval [--dataset] run retrieval-quality evaluation against packaged or custom datasets
dikw client tasks {list,status,events,wait,cancel} inspect running / past async tasks on the server
dikw client serve-and-run -- <cmd> one-shot server + inner command + teardown (no long-lived dikw serve needed)

The dikw auth {login,import,status,list,logout} subgroup is local — it manages OAuth tokens in <base>/.dikw/auth.json without talking to a server (used by the openai_codex provider; see docs/providers.md).

Providers

Configured via dikw.yml:

provider:
  llm: anthropic_compat         # or: openai_compat
  llm_model: claude-sonnet-4-6
  llm_base_url: null            # set for any Anthropic-protocol-compatible endpoint
  embedding: openai_compat
  embedding_model: text-embedding-3-small
  embedding_base_url: https://api.openai.com/v1
  embedding_dim: 1536           # required: must match what the endpoint returns
  embedding_revision: ""        # bump to force re-embed when vendor refreshes weights silently
  embedding_normalize: true
  embedding_distance: cosine

llm names a wire protocol (which SDK to speak), not a vendor — the actual vendor is whatever llm_base_url points at.

  • anthropic_compat → uses the anthropic async SDK with cache_control on the system prompt, so repeated synth calls hit the prompt cache. Set llm_base_url to retarget the SDK at any Anthropic-protocol-compatible endpoint (e.g., MiniMax's https://api.minimaxi.com/anthropic); leave null for api.anthropic.com.
  • openai_compat → uses the openai async SDK against any base URL that speaks the OpenAI HTTP surface (Azure, Ollama, vLLM, DeepSeek, MiniMax, …).

Full vendor cookbook (MiniMax, GLM, Gemini, DeepSeek, Gitee AI, Ollama, …) and the production gotchas around batch size, embedding dimensions, and retry/caching live in docs/providers.md.

Using MiniMax LLM + Gitee AI embeddings

MiniMax has no embeddings endpoint — pair its Anthropic-compatible LLM surface with an OpenAI-compatible embedding vendor. The example below uses Gitee AI (Qwen3-Embedding-0.6B, 1024 native — the recommended default; swap in Qwen3-Embedding-8B with embedding_dim: 1024 matryoshka or 4096 native for higher-cost / marginal-quality runs). Fill the URLs in by hand — dikw-core never auto-detects vendor endpoints:

provider:
  llm: anthropic_compat
  llm_model: <MiniMax Anthropic-compatible model name>
  llm_base_url: https://api.minimaxi.com/anthropic
  embedding: openai_compat
  embedding_model: Qwen3-Embedding-0.6B
  embedding_base_url: https://ai.gitee.com/v1
  embedding_dim: 1024               # 0.6B native; locked at first ingest
  embedding_revision: ""            # bump to force re-embed when Qwen weights drift silently
  embedding_normalize: true
  embedding_distance: cosine
  embedding_batch_size: 16          # required: Gitee rejects batches >25
  embedding_provider_label: gitee-ai  # optional; shows up in `dikw client check`

A working reference copy lives at tests/fixtures/live-minimax-gitee.dikw.yml — drop it into a fresh base and fill in your two keys.

Two keys for two vendors — the embedding leg reads DIKW_EMBEDDING_API_KEY exclusively (no OPENAI_API_KEY fallback), so misconfigurations fail loudly rather than cross-wiring credentials:

export ANTHROPIC_API_KEY=<your-MiniMax-key>
export DIKW_EMBEDDING_API_KEY=<your-Gitee-key>

Verify connectivity before running ingest/synth. The two legs can be probed separately, which is useful when you set up one vendor first:

uv run dikw client check --llm-only     # just LLM — useful before Gitee is wired up
uv run dikw client check --embed-only   # just embedding
uv run dikw client check                # both

dikw client check pings each provider with one tiny request and prints a status table with endpoint, latency, and dim/tokens. Exit code is 0 on success, 1 on failure, 2 on flag misuse — scriptable in CI or a shell one-liner.

Source formats

Markdown ships out of the box. A new format is one SourceBackend subclass + a register() call away — see domains/data/backends/markdown.py for the reference impl.

Storage

Two backends ship, selected in dikw.yml:

storage:
  backend: sqlite          # sqlite | postgres

  # --- sqlite (default): single-user local ---
  path: .dikw/index.sqlite

  # --- postgres (enterprise): multi-user, pgvector + tsvector ---
  # backend: postgres
  # dsn: postgresql://user:pw@host:5432/dikw
  # schema: dikw
  # pool_size: 10
  • SQLite + sqlite-vec + FTS5 — the default. No extras required.
  • Postgres + pgvector — install via uv pip install dikw-core[postgres]. Requires the pg_trgm and vector extensions (standard on the pgvector/pgvector:0.8.2-pg18 Docker image). The adapter uses tsvector+GIN for FTS and vector(N) for embeddings; the vector dimension is set at first insert.

Engine code talks only to the Storage Protocol (storage/base.py); each adapter implements the same contract and is swappable by changing dikw.yml.

Releasing

Tagged pushes (vX.Y.Z) trigger .github/workflows/release.yml, which builds sdist + wheel, re-runs the full test gate, and publishes to PyPI via trusted publishing (no token in repo secrets). One-time setup on PyPI's side:

  1. Create the dikw-core project on PyPI.
  2. On the project's Publishing page, add a GitHub trusted publisher with:
    • owner: OpenDIKW
    • repository: dikw-core
    • workflow: release.yml
    • environment: pypi

After that, git tag vX.Y.Z && git push --tags is enough. The release workflow also opens a chore(docker): bump DIKW_VERSION to vX.Y.Z PR against main after a successful PyPI publish, keeping examples/docker/Dockerfile in lockstep with the latest published wheel; merge that chore PR to clear the post-release queue. The dockerfile-version-guard job in reusable-ci.yml enforces the invariant on every PR.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dikw_core-0.4.6.tar.gz (1.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dikw_core-0.4.6-py3-none-any.whl (1.3 MB view details)

Uploaded Python 3

File details

Details for the file dikw_core-0.4.6.tar.gz.

File metadata

  • Download URL: dikw_core-0.4.6.tar.gz
  • Upload date:
  • Size: 1.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dikw_core-0.4.6.tar.gz
Algorithm Hash digest
SHA256 4f172321b9122dc91f1cf58784a789458e51b763c6bb11fa863481d6f8300b8e
MD5 422e60cbdb55348aacb2c4e1ea48c829
BLAKE2b-256 4e86f31ed40993a585702c0e6e437dc75551a55177a972559842a2099bc4707b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_core-0.4.6.tar.gz:

Publisher: release.yml on OpenDIKW/dikw-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dikw_core-0.4.6-py3-none-any.whl.

File metadata

  • Download URL: dikw_core-0.4.6-py3-none-any.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dikw_core-0.4.6-py3-none-any.whl
Algorithm Hash digest
SHA256 69f1cfea784bd8cbe91828406a85ab92be1d1694731354d31414ca2541beeb85
MD5 5d1dae9d2ee30782c0280bb5fc6d575f
BLAKE2b-256 c5ebaac93192d8026497900ef107fd7327db67c4aa10132fb0d856ba77eba260

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_core-0.4.6-py3-none-any.whl:

Publisher: release.yml on OpenDIKW/dikw-core

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page