Skip to main content

Incrementally convert documents to knowledge organized in wiki

Project description

cocoindex-wiki

ccwiki incrementally converts a collection of markdown documents into a structured, interconnected wiki using LLMs. It plans categories, extracts and deduplicates entities, produces curated knowledge entries, and keeps the wiki in sync with the source documents as they change — powered by CocoIndex for incremental processing and function memoization.

Use it from your coding agent via the included Claude Code skill, or from the CLI directly.

Inspired by Andrej Karpathy's note on wiki-style knowledge organization.

What it does

Given a directory of markdown files, ccwiki:

  1. Plans categories for the wiki based on the source documents.
  2. Extracts canonical entities (people, systems, events, concepts, ...) from each document.
  3. Resolves duplicates across documents (e.g., "Alice Chen" and "Alice" → one canonical entry) using embedding similarity plus LLM confirmation.
  4. Writes per-entity wiki entries by combining knowledge from every source that mentioned the entity, with markdown cross-references and source footnotes.
  5. Keeps everything incremental — edit a source doc and only affected entries get rebuilt.

The output is a set of markdown files organized by category, suitable for browsing in any markdown renderer (Obsidian, VS Code, GitHub, etc.).

Get Started

Install

Using pipx:

pipx install cocoindex-wiki          # first install
pipx upgrade cocoindex-wiki          # upgrade

Using uv:

uv tool install --upgrade cocoindex-wiki

Requires Python 3.11+. After installation, the ccwiki command is available globally.

Configure the LLM

Set your primary model and the matching provider API key:

export CCWIKI_LLM_MODEL="anthropic/claude-haiku-4-5-20251001"
export ANTHROPIC_API_KEY="..."

# Optional: a lighter/cheaper model for entity extraction and resolution
export CCWIKI_LLM_MODEL_LITE="anthropic/claude-haiku-4-5-20251001"

CCWIKI_LLM_MODEL accepts any LiteLLM-compatible model name (OpenAI, Anthropic, Google, and others). See tests/e2e/run_test.sh for working examples with Gemini, OpenAI, and Anthropic.

Coding Agent Integration

Skill (Recommended)

Install the ccwiki skill so your coding agent can set up and build your wiki interactively — it handles installation, category design, config writing, and indexing on its own:

npx skills add cocoindex-io/cocoindex-wiki

The skill teaches the agent to:

  • Install cocoindex-wiki and verify CCWIKI_LLM_MODEL + the matching provider API key are set.
  • Read your source docs first and propose category sets with explicit granularity rules.
  • Write @WIKI.md and each @WIKI_CATEGORY.md to disk, then iterate with you on boundaries and descriptions.
  • Run ccwiki index to build the wiki, show the result, and re-run incrementally as you edit sources or categories.

Just ask your agent something like "help me set up a ccwiki for this folder of notes" or "add a new category for research papers", or type /ccwiki to invoke the skill directly. Works with Claude Code and other skill-compatible agents.

Manual CLI Usage

You can also use the CLI directly — useful for scripted pipelines, CI jobs, or when you want full control without an agent in the loop.

1. Initialize a project

In the directory containing your markdown source files:

ccwiki init

This creates a @WIKI.md at the project root with default settings. Edit it to describe your project's purpose and adjust include_patterns / exclude_patterns / output_dir as needed.

2. Plan the wiki categories

ccwiki plan

This reads your source documents and asks the LLM to propose a category schema, then writes a @WIKI_CATEGORY.md in each category subdirectory. For finer control over categories, use the skill instead — it keeps a human in the loop during design.

3. Build the wiki

ccwiki index

This runs the full pipeline: entity extraction → resolution → knowledge extraction → combining → file output. The resulting wiki files appear under wiki/ (or your configured output_dir).

On subsequent runs, only affected entries are rebuilt (thanks to CocoIndex memoization) — edits to one source doc typically touch only the entries that mentioned the changed entities.

Verbose logging

Add -v to see entity extraction and deduplication logs:

ccwiki -v index

Project layout

A typical ccwiki project:

my-project/
├── @WIKI.md                       # Project-level config
├── docs/                          # Source markdown files
│   ├── overview.md
│   ├── team.md
│   └── ...
└── wiki/                          # Generated wiki (the output)
    ├── People/
    │   ├── @WIKI_CATEGORY.md      # Category-level config
    │   ├── Alice Chen.md
    │   └── Bob Martinez.md
    ├── Products/
    │   ├── @WIKI_CATEGORY.md
    │   └── ...
    └── ...

@WIKI.md describes the project. Each @WIKI_CATEGORY.md defines what kinds of entries belong in that category and how to write them. See skills/ccwiki/references/file_formats.md for the full spec.

How it works

ccwiki runs a multi-phase pipeline on CocoIndex v1:

  1. Phase 1 — Entity extraction (per source document, parallel). An LLM reads each raw doc and extracts canonical entities, classified by category. Entity names are sanitized for filesystem safety immediately after extraction.

  2. Phase 2.1 — Entity resolution (per category, parallel). For each category, entities are embedded with SentenceTransformer, similar ones are found via FAISS, and an LLM confirms or rejects matches. A stability rule prefers entities that already have an existing .md file, so canonical choices stay consistent across runs.

  3. Phase 2.2 — Knowledge extraction and combining (per canonical entity). For each entity, knowledge is extracted from each contributing source doc, then combined into a single entry via a stable K-ary combining tree. The tree structure is determined by stable file-path fingerprints, so when one source doc changes only log(N)/log(K) combining calls re-run. Final entries include markdown cross-links between related entities and numbered [^src1] footnotes pointing back to source documents.

All LLM calls are memoized by CocoIndex, so re-running ccwiki index after small edits is fast and cheap. Wiki files are declared as CocoIndex target states, so entities that are no longer canonical (merged or removed) have their files automatically deleted.

Testing

End-to-end test scenarios live under tests/e2e/:

  • tests/e2e/solar_system/ — small dataset (8 docs about the Solar System)
  • tests/e2e/helios_labs/ — larger dataset (24 docs about a fictional AI startup)
  • tests/e2e/arxiv_papers/ — 50 real arXiv papers

Run a scenario against one or all LLM providers:

source ~/.env.llm  # sets OPENAI_API_KEY, GEMINI_API_KEY, ANTHROPIC_API_KEY
bash tests/e2e/run_test.sh tests/e2e/solar_system              # all providers
bash tests/e2e/run_test.sh tests/e2e/solar_system anthropic    # single provider

Results are archived under archived_results/ in each scenario directory.

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex_wiki-0.0.2.tar.gz (292.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cocoindex_wiki-0.0.2-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file cocoindex_wiki-0.0.2.tar.gz.

File metadata

  • Download URL: cocoindex_wiki-0.0.2.tar.gz
  • Upload date:
  • Size: 292.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cocoindex_wiki-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a1262c81d373098959fbea8166d2c7f930d8dfa6e79dde083fdf46bd3f351f37
MD5 2301d8545fb5c9d5fe7beef1264c69de
BLAKE2b-256 040ea2cb8c5eba88cc06b73a49c58e205daeb30ddab29bed3e733d975095a233

See more details on using hashes here.

Provenance

The following attestation bundles were made for cocoindex_wiki-0.0.2.tar.gz:

Publisher: release.yml on cocoindex-io/cocoindex-wiki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cocoindex_wiki-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: cocoindex_wiki-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cocoindex_wiki-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3c3b9bfa4a9d5d48e8f7f64c8cd9c3c31511c9e7533eb33292916cfbf4e900e4
MD5 cfa9459414fa8d9dc28dbf4e5db55518
BLAKE2b-256 e8757e7bb287000dfaaaa8d8b267e28224844fa58b34ceb4ea6899d319f24b1c

See more details on using hashes here.

Provenance

The following attestation bundles were made for cocoindex_wiki-0.0.2-py3-none-any.whl:

Publisher: release.yml on cocoindex-io/cocoindex-wiki

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page