Incrementally convert documents to knowledge organized in wiki
Project description
cocoindex-wiki
ccwiki incrementally converts a collection of markdown documents into a structured, interconnected wiki using LLMs. It plans categories, extracts and deduplicates entities, produces curated knowledge entries, and keeps the wiki in sync with the source documents as they change — powered by CocoIndex for incremental processing and function memoization.
Use it from your coding agent via the included Claude Code skill, or from the CLI directly.
Inspired by Andrej Karpathy's note on wiki-style knowledge organization.
What it does
Given a directory of markdown files, ccwiki:
- Plans categories for the wiki based on the source documents.
- Extracts canonical entities (people, systems, events, concepts, ...) from each document.
- Resolves duplicates across documents (e.g., "Alice Chen" and "Alice" → one canonical entry) using embedding similarity plus LLM confirmation.
- Writes per-entity wiki entries by combining knowledge from every source that mentioned the entity, with markdown cross-references and source footnotes.
- Keeps everything incremental — edit a source doc and only affected entries get rebuilt.
The output is a set of markdown files organized by category, suitable for browsing in any markdown renderer (Obsidian, VS Code, GitHub, etc.).
Get Started
Install
Using pipx:
pipx install cocoindex-wiki # first install
pipx upgrade cocoindex-wiki # upgrade
Using uv:
uv tool install --upgrade cocoindex-wiki
Requires Python 3.11+. After installation, the ccwiki command is available globally.
Configure the LLM
Set your primary model and the matching provider API key:
export CCWIKI_LLM_MODEL="anthropic/claude-haiku-4-5-20251001"
export ANTHROPIC_API_KEY="..."
# Optional: a lighter/cheaper model for entity extraction and resolution
export CCWIKI_LLM_MODEL_LITE="anthropic/claude-haiku-4-5-20251001"
CCWIKI_LLM_MODEL accepts any LiteLLM-compatible model name (OpenAI, Anthropic, Google, and others). See tests/e2e/run_test.sh for working examples with Gemini, OpenAI, and Anthropic.
Coding Agent Integration
Skill (Recommended)
Install the ccwiki skill so your coding agent can set up and build your wiki interactively — it handles installation, category design, config writing, and indexing on its own:
npx skills add cocoindex-io/cocoindex-wiki
The skill teaches the agent to:
- Install
cocoindex-wikiand verifyCCWIKI_LLM_MODEL+ the matching provider API key are set. - Read your source docs first and propose category sets with explicit granularity rules.
- Write
@WIKI.mdand each@WIKI_CATEGORY.mdto disk, then iterate with you on boundaries and descriptions. - Run
ccwiki indexto build the wiki, show the result, and re-run incrementally as you edit sources or categories.
Just ask your agent something like "help me set up a ccwiki for this folder of notes" or "add a new category for research papers", or type /ccwiki to invoke the skill directly. Works with Claude Code and other skill-compatible agents.
Manual CLI Usage
You can also use the CLI directly — useful for scripted pipelines, CI jobs, or when you want full control without an agent in the loop.
1. Initialize a project
In the directory containing your markdown source files:
ccwiki init
This creates a @WIKI.md at the project root with default settings. Edit it to describe your project's purpose and adjust include_patterns / exclude_patterns / output_dir as needed.
2. Plan the wiki categories
ccwiki plan
This reads your source documents and asks the LLM to propose a category schema, then writes a @WIKI_CATEGORY.md in each category subdirectory. For finer control over categories, use the skill instead — it keeps a human in the loop during design.
3. Build the wiki
ccwiki index
This runs the full pipeline: entity extraction → resolution → knowledge extraction → combining → file output. The resulting wiki files appear under wiki/ (or your configured output_dir).
On subsequent runs, only affected entries are rebuilt (thanks to CocoIndex memoization) — edits to one source doc typically touch only the entries that mentioned the changed entities.
Verbose logging
Add -v to see entity extraction and deduplication logs:
ccwiki -v index
Project layout
A typical ccwiki project:
my-project/
├── @WIKI.md # Project-level config
├── docs/ # Source markdown files
│ ├── overview.md
│ ├── team.md
│ └── ...
└── wiki/ # Generated wiki (the output)
├── People/
│ ├── @WIKI_CATEGORY.md # Category-level config
│ ├── Alice Chen.md
│ └── Bob Martinez.md
├── Products/
│ ├── @WIKI_CATEGORY.md
│ └── ...
└── ...
@WIKI.md describes the project. Each @WIKI_CATEGORY.md defines what kinds of entries belong in that category and how to write them. See skills/ccwiki/references/file_formats.md for the full spec.
How it works
ccwiki runs a multi-phase pipeline on CocoIndex v1:
-
Phase 1 — Entity extraction (per source document, parallel). An LLM reads each raw doc and extracts canonical entities, classified by category. Entity names are sanitized for filesystem safety immediately after extraction.
-
Phase 2.1 — Entity resolution (per category, parallel). For each category, entities are embedded with
SentenceTransformer, similar ones are found via FAISS, and an LLM confirms or rejects matches. A stability rule prefers entities that already have an existing.mdfile, so canonical choices stay consistent across runs. -
Phase 2.2 — Knowledge extraction and combining (per canonical entity). For each entity, knowledge is extracted from each contributing source doc, then combined into a single entry via a stable K-ary combining tree. The tree structure is determined by stable file-path fingerprints, so when one source doc changes only
log(N)/log(K)combining calls re-run. Final entries include markdown cross-links between related entities and numbered[^src1]footnotes pointing back to source documents.
All LLM calls are memoized by CocoIndex, so re-running ccwiki index after small edits is fast and cheap. Wiki files are declared as CocoIndex target states, so entities that are no longer canonical (merged or removed) have their files automatically deleted.
Testing
End-to-end test scenarios live under tests/e2e/:
tests/e2e/solar_system/— small dataset (8 docs about the Solar System)tests/e2e/helios_labs/— larger dataset (24 docs about a fictional AI startup)tests/e2e/arxiv_papers/— 50 real arXiv papers
Run a scenario against one or all LLM providers:
source ~/.env.llm # sets OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY
bash tests/e2e/run_test.sh tests/e2e/solar_system # all providers
bash tests/e2e/run_test.sh tests/e2e/solar_system anthropic # single provider
Results are archived under archived_results/ in each scenario directory.
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cocoindex_wiki-0.0.2.tar.gz.
File metadata
- Download URL: cocoindex_wiki-0.0.2.tar.gz
- Upload date:
- Size: 292.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1262c81d373098959fbea8166d2c7f930d8dfa6e79dde083fdf46bd3f351f37
|
|
| MD5 |
2301d8545fb5c9d5fe7beef1264c69de
|
|
| BLAKE2b-256 |
040ea2cb8c5eba88cc06b73a49c58e205daeb30ddab29bed3e733d975095a233
|
Provenance
The following attestation bundles were made for cocoindex_wiki-0.0.2.tar.gz:
Publisher:
release.yml on cocoindex-io/cocoindex-wiki
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cocoindex_wiki-0.0.2.tar.gz -
Subject digest:
a1262c81d373098959fbea8166d2c7f930d8dfa6e79dde083fdf46bd3f351f37 - Sigstore transparency entry: 1331643353
- Sigstore integration time:
-
Permalink:
cocoindex-io/cocoindex-wiki@41738e4ac3df2c7a3eb5da8a405e7bfd3409465d -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/cocoindex-io
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@41738e4ac3df2c7a3eb5da8a405e7bfd3409465d -
Trigger Event:
release
-
Statement type:
File details
Details for the file cocoindex_wiki-0.0.2-py3-none-any.whl.
File metadata
- Download URL: cocoindex_wiki-0.0.2-py3-none-any.whl
- Upload date:
- Size: 30.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c3b9bfa4a9d5d48e8f7f64c8cd9c3c31511c9e7533eb33292916cfbf4e900e4
|
|
| MD5 |
cfa9459414fa8d9dc28dbf4e5db55518
|
|
| BLAKE2b-256 |
e8757e7bb287000dfaaaa8d8b267e28224844fa58b34ceb4ea6899d319f24b1c
|
Provenance
The following attestation bundles were made for cocoindex_wiki-0.0.2-py3-none-any.whl:
Publisher:
release.yml on cocoindex-io/cocoindex-wiki
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
cocoindex_wiki-0.0.2-py3-none-any.whl -
Subject digest:
3c3b9bfa4a9d5d48e8f7f64c8cd9c3c31511c9e7533eb33292916cfbf4e900e4 - Sigstore transparency entry: 1331643437
- Sigstore integration time:
-
Permalink:
cocoindex-io/cocoindex-wiki@41738e4ac3df2c7a3eb5da8a405e7bfd3409465d -
Branch / Tag:
refs/tags/v0.0.2 - Owner: https://github.com/cocoindex-io
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@41738e4ac3df2c7a3eb5da8a405e7bfd3409465d -
Trigger Event:
release
-
Statement type: