AI-native knowledge engine across the DIKW pyramid (Data → Information → Knowledge → Wisdom)
Project description
dikw-core
AI-native knowledge engine that turns your documents into Data → Information → Knowledge → Wisdom.
Inspired by Karpathy's LLM Wiki pattern, extended end-to-end across the full DIKW pyramid. Where Karpathy's pattern stops at a compounding markdown knowledge base (the K layer), dikw-core adds a first-class Wisdom layer for human-authored principles, lessons, and patterns that apply beyond any single source.
Status: alpha. Under active construction; APIs, on-disk formats, database schema, and CLI will change.
What you get
- A local-first knowledge base — the dikw base — where the on-disk layout is a plain markdown tree your editor (Obsidian, VS Code, …) can open directly.
- Four explicit DIKW layers with their own operations:
- Data — raw sources you curate.
- Information — parsed, chunked, embedded, indexed (FTS5 + vectors).
- Knowledge — LLM-authored knowledge pages with
[[wikilinks]],index.md, and an append-onlylog.md. - Wisdom — hand-written markdown principles / lessons / patterns authored under
wisdom/<author>/, indexed (chunked + embedded) so they surface inretrievealongside K-layer pages.
- Pluggable LLM providers (API-first): Anthropic + OpenAI-compatible (covers OpenAI, Azure, Ollama, DeepSeek, Gemini-compat).
- Pluggable storage: SQLite+sqlite-vec (default), Postgres+pgvector (enterprise) — swap by config.
- Client / server architecture. A long-lived
dikw serve(FastAPI + NDJSON) hosts the engine; thedikw client …Typer CLI talks to it over HTTP, streams progress events for long ops, and supports cancel / resume.
Install & quick start
Requires Python 3.12+ and uv.
git clone https://github.com/OpenDIKW/dikw-core
cd dikw-core
uv sync
uv run dikw init my-base --description "my research base"
cd my-base
# Drop some markdown into sources/, then run any single command via
# `dikw client serve-and-run` — it spawns a local server, runs the
# inner command, and tears it down.
uv run dikw client serve-and-run -- ingest --no-embed
uv run dikw client serve-and-run -- retrieve "What does Karpathy mean by deterministic scoping?"
For interactive sessions or long iterations, run dikw serve once and
keep using dikw client * against it:
uv run dikw serve --base . # in one terminal
# in another:
uv run dikw client status
uv run dikw client synth # K layer (needs ANTHROPIC_API_KEY or OpenAI-compat)
uv run dikw client retrieve "What does Karpathy mean by deterministic scoping?"
Every HTTP-bound command is spelled out as
dikw client <verb>; there are no top-level short aliases.dikw-coreno longer ships an in-engine answer-synthesis path —retrievereturns ranked chunks + page refs and the agent (Claude Code, ChatGPT, your own script) feeds them into its own LLM. SeeGUIDE_FOR_AGENTS.md.
Server deployment, security posture, and the wire contract live in
docs/server.md. For container deployment, see
examples/docker/ (Dockerfile + compose stack
with pgvector/pgvector:0.8.2-pg18) and the long-form
docs/deployment-docker.md.
End-to-end walkthrough: docs/getting-started.md.
Architecture brief: docs/architecture.md.
Approved design doc: docs/design.md.
Commands
Local-only commands run in this process:
| command | does |
|---|---|
dikw version |
print the package version |
dikw init <path> |
scaffold a dikw base (sources / knowledge / wisdom / .dikw/ + dikw.yml) |
dikw serve --base <path> |
start the FastAPI + NDJSON server bound to one base |
Everything else lives under dikw client * and talks to a running server.
There are no top-level short aliases — spelling out the client prefix
keeps the local-vs-HTTP boundary unambiguous for both agents and humans:
| command | does |
|---|---|
dikw client status |
counts across DIKW layers |
dikw client info |
raw GET /v1/info passthrough — version, storage backend, auth posture |
dikw client health |
server self-description (base, version, storage, providers) — the first call an agent makes |
dikw client check |
ping the configured LLM + embedding endpoints to verify dikw.yml + keys |
dikw client import <path> |
pre-flight + import local md packages (md + referenced assets) into the server's sources/ |
dikw client ingest [--no-embed] |
parse + chunk + FTS-index + embed the server's sources/ tree |
dikw client retrieve "<q>" |
hybrid search returning ranked chunks + page refs (no LLM call); agent supplies its own synthesis |
dikw client synth [--all] |
LLM turns source docs into K-layer knowledge pages; maintains index.md+log.md |
dikw client lint [propose|proposals|apply] |
report broken wikilinks / orphan pages / duplicate titles; propose + apply structured fixes |
dikw client pages {list,get,links,provenance} |
enumerate pages / read a page body + chunk anchors / walk the K-layer link graph / walk the K↔D provenance edge |
dikw client graph get |
fetch the whole base graph (nodes + edges + unresolved wikilinks) in one read |
dikw client assets get <id> --output <file> |
download a content-addressed asset by sha256 id |
dikw client eval [--dataset] |
run retrieval-quality evaluation against packaged or custom datasets |
dikw client tasks {list,status,events,wait,cancel} |
inspect running / past async tasks on the server |
dikw client serve-and-run -- <cmd> |
one-shot server + inner command + teardown (no long-lived dikw serve needed) |
The dikw auth {login,import,status,list,logout} subgroup is local —
it manages OAuth tokens in <base>/.dikw/auth.json without talking to a
server (used by the openai_codex provider; see docs/providers.md).
Providers
Configured via dikw.yml:
provider:
llm: anthropic_compat # or: openai_compat
llm_model: claude-sonnet-4-6
llm_base_url: null # set for any Anthropic-protocol-compatible endpoint
embedding: openai_compat
embedding_model: text-embedding-3-small
embedding_base_url: https://api.openai.com/v1
embedding_dim: 1536 # required: must match what the endpoint returns
embedding_revision: "" # bump to force re-embed when vendor refreshes weights silently
embedding_normalize: true
embedding_distance: cosine
llm names a wire protocol (which SDK to speak), not a vendor — the
actual vendor is whatever llm_base_url points at.
anthropic_compat→ uses theanthropicasync SDK withcache_controlon the system prompt, so repeated synth calls hit the prompt cache. Setllm_base_urlto retarget the SDK at any Anthropic-protocol-compatible endpoint (e.g., MiniMax'shttps://api.minimaxi.com/anthropic); leave null for api.anthropic.com.openai_compat→ uses theopenaiasync SDK against any base URL that speaks the OpenAI HTTP surface (Azure, Ollama, vLLM, DeepSeek, MiniMax, …).
Full vendor cookbook (MiniMax, GLM, Gemini, DeepSeek, Gitee AI, Ollama, …)
and the production gotchas around batch size, embedding dimensions, and
retry/caching live in docs/providers.md.
Using MiniMax LLM + Gitee AI embeddings
MiniMax has no embeddings endpoint — pair its Anthropic-compatible LLM surface
with an OpenAI-compatible embedding vendor. The example below uses
Gitee AI (Qwen3-Embedding-0.6B, 1024 native — the
recommended default; swap in Qwen3-Embedding-8B with embedding_dim: 1024
matryoshka or 4096 native for higher-cost / marginal-quality runs).
Fill the URLs in by hand — dikw-core never auto-detects vendor endpoints:
provider:
llm: anthropic_compat
llm_model: <MiniMax Anthropic-compatible model name>
llm_base_url: https://api.minimaxi.com/anthropic
embedding: openai_compat
embedding_model: Qwen3-Embedding-0.6B
embedding_base_url: https://ai.gitee.com/v1
embedding_dim: 1024 # 0.6B native; locked at first ingest
embedding_revision: "" # bump to force re-embed when Qwen weights drift silently
embedding_normalize: true
embedding_distance: cosine
embedding_batch_size: 16 # required: Gitee rejects batches >25
embedding_provider_label: gitee-ai # optional; shows up in `dikw client check`
A working reference copy lives at
tests/fixtures/live-minimax-gitee.dikw.yml
— drop it into a fresh base and fill in your two keys.
Two keys for two vendors — the embedding leg reads DIKW_EMBEDDING_API_KEY
exclusively (no OPENAI_API_KEY fallback), so misconfigurations fail loudly
rather than cross-wiring credentials:
export ANTHROPIC_API_KEY=<your-MiniMax-key>
export DIKW_EMBEDDING_API_KEY=<your-Gitee-key>
Verify connectivity before running ingest/synth. The two legs can be probed separately, which is useful when you set up one vendor first:
uv run dikw client check --llm-only # just LLM — useful before Gitee is wired up
uv run dikw client check --embed-only # just embedding
uv run dikw client check # both
dikw client check pings each provider with one tiny request and prints a
status table with endpoint, latency, and dim/tokens. Exit code is 0 on
success, 1 on failure, 2 on flag misuse — scriptable in CI or a shell
one-liner.
Source formats
Markdown ships out of the box. A new format is one SourceBackend
subclass + a register() call away — see
domains/data/backends/markdown.py
for the reference impl.
Storage
Two backends ship, selected in dikw.yml:
storage:
backend: sqlite # sqlite | postgres
# --- sqlite (default): single-user local ---
path: .dikw/index.sqlite
# --- postgres (enterprise): multi-user, pgvector + tsvector ---
# backend: postgres
# dsn: postgresql://user:pw@host:5432/dikw
# schema: dikw
# pool_size: 10
- SQLite +
sqlite-vec+ FTS5 — the default. No extras required. - Postgres +
pgvector— install viauv pip install dikw-core[postgres]. Requires thepg_trgmandvectorextensions (standard on thepgvector/pgvector:0.8.2-pg18Docker image). The adapter usestsvector+GIN for FTS andvector(N)for embeddings; the vector dimension is set at first insert.
Engine code talks only to the Storage Protocol
(storage/base.py); each adapter
implements the same contract and is swappable by changing dikw.yml.
Releasing
Tagged pushes (vX.Y.Z) trigger
.github/workflows/release.yml, which
builds sdist + wheel, re-runs the full test gate, and publishes to PyPI
via trusted publishing (no token in repo secrets). One-time setup on
PyPI's side:
- Create the
dikw-coreproject on PyPI. - On the project's Publishing page, add a GitHub trusted publisher with:
- owner:
OpenDIKW - repository:
dikw-core - workflow:
release.yml - environment:
pypi
- owner:
After that, git tag vX.Y.Z && git push --tags is enough. The release
workflow also opens a chore(docker): bump DIKW_VERSION to vX.Y.Z PR
against main after a successful PyPI publish, keeping
examples/docker/Dockerfile in lockstep with the latest published
wheel; merge that chore PR to clear the post-release queue. The
dockerfile-version-guard job in reusable-ci.yml enforces the
invariant on every PR.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dikw_core-0.4.6.tar.gz.
File metadata
- Download URL: dikw_core-0.4.6.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f172321b9122dc91f1cf58784a789458e51b763c6bb11fa863481d6f8300b8e
|
|
| MD5 |
422e60cbdb55348aacb2c4e1ea48c829
|
|
| BLAKE2b-256 |
4e86f31ed40993a585702c0e6e437dc75551a55177a972559842a2099bc4707b
|
Provenance
The following attestation bundles were made for dikw_core-0.4.6.tar.gz:
Publisher:
release.yml on OpenDIKW/dikw-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dikw_core-0.4.6.tar.gz -
Subject digest:
4f172321b9122dc91f1cf58784a789458e51b763c6bb11fa863481d6f8300b8e - Sigstore transparency entry: 1675632449
- Sigstore integration time:
-
Permalink:
OpenDIKW/dikw-core@6cce7c9f6cb26ff958711912c6dd2323cc079a79 -
Branch / Tag:
refs/tags/v0.4.6 - Owner: https://github.com/OpenDIKW
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6cce7c9f6cb26ff958711912c6dd2323cc079a79 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dikw_core-0.4.6-py3-none-any.whl.
File metadata
- Download URL: dikw_core-0.4.6-py3-none-any.whl
- Upload date:
- Size: 1.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69f1cfea784bd8cbe91828406a85ab92be1d1694731354d31414ca2541beeb85
|
|
| MD5 |
5d1dae9d2ee30782c0280bb5fc6d575f
|
|
| BLAKE2b-256 |
c5ebaac93192d8026497900ef107fd7327db67c4aa10132fb0d856ba77eba260
|
Provenance
The following attestation bundles were made for dikw_core-0.4.6-py3-none-any.whl:
Publisher:
release.yml on OpenDIKW/dikw-core
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dikw_core-0.4.6-py3-none-any.whl -
Subject digest:
69f1cfea784bd8cbe91828406a85ab92be1d1694731354d31414ca2541beeb85 - Sigstore transparency entry: 1675632493
- Sigstore integration time:
-
Permalink:
OpenDIKW/dikw-core@6cce7c9f6cb26ff958711912c6dd2323cc079a79 -
Branch / Tag:
refs/tags/v0.4.6 - Owner: https://github.com/OpenDIKW
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6cce7c9f6cb26ff958711912c6dd2323cc079a79 -
Trigger Event:
push
-
Statement type: