Turn any folder of docs into a queryable wiki + knowledge graph. Three-layer architecture: articles · concepts · graph.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lcccben

These details have not been verified by PyPI

Project description

lcwiki

English | 简体中文

A drop-in wiki-builder skill for AI coding assistants. Type /lcwiki in Claude Code or OpenClaw — it reads any folder of docs, compiles them into a structured wiki + knowledge graph, and lets your AI answer questions from that wiki at ~10% the token cost of vanilla RAG.

Fully multimodal. Drop in .docx, .pdf, .xlsx, .pptx, markdown, images, audio, or video — lcwiki converts everything to markdown, extracts per-doc structure, concepts with family aliases, and a vis-network knowledge graph in one shot. Then it lets your AI query the wiki with a three-layer token-first fallback: scan 100-token tldrs → fall back to article body → only touch raw content as a last resort.

Inspired by Andrej Karpathy's /raw folder idea — the one he talks about on podcasts, where he drops papers, screenshots, tweets, and whiteboard photos into a single directory and wants his AI to just understand it all. safishamsi/graphify turned that folder into a knowledge graph. lcwiki takes it one layer further: it turns your /raw folder into a proper wiki and a graph — so your AI has both long-term memory and a map. Every doc gets a structured article with a 100-token tldr for cheap lookups; every concept gets a standalone page with family aliases; every connection lives in a persistent queryable graph. All three layers are built in one shot by a Claude subagent pass, wrapped with CLI-atomic write-verify so your agent can't silently corrupt the data, and maintained long-term by a self-healing /lcwiki audit with LLM-as-judge checks.

/lcwiki ingest      # drop anything into raw/inbox/, convert + stage
/lcwiki compile     # LLM reads each doc → structured articles + concepts
/lcwiki graph       # build knowledge graph → graph.html
/lcwiki query "what's the budget of project X?"

vault/
├── wiki/
│   ├── articles/*.md     per-doc wiki pages with YAML frontmatter + 100-token tldr
│   └── concepts/*.md     standalone concept pages with 4-section body + aliases
├── graph/
│   ├── graph.html        interactive vis-network graph — click nodes, jump to source
│   ├── graph.json        persistent graph — query weeks later without re-reading
│   └── GRAPH_REPORT.md   god nodes, surprising connections, audit findings
└── meta/
    ├── concepts_index.json   concept aliases → canonical name
    └── source_map.json       sha256 → raw file → generated articles

Add a .lcwikiignore to skip folders:

# .lcwikiignore
_archive/
drafts/
*.generated.md

Why three layers

Traditional vector RAG fails on structured content (proposals, contracts, research reports): chunks break tables across boundaries, embedding recall is noisy on domain-specific terms, and every query re-reads the same chunks and burns tokens summarizing the same prose forever.

lcwiki's fix: read each doc once with an LLM, write a proper wiki, query the wiki.

Layer	Size	Query pattern
1. Article (`articles/*.md`)	3–8 KB per doc	"Tell me about this doc"
2. Concept (`concepts/*.md`)	1–2 KB per concept	"What is X?" across docs
3. Graph (`graph/*.json`)	—	"How are these connected?"

Every query uses a token-first fallback: scan every article's ~100-token tldr first (usually <5K total for a whole KB), open the matching article body only if needed (3K), touch raw content only as a last resort. On real proposal-style corpora, this cuts per-query token cost by ~80% vs vanilla RAG while matching or beating accuracy on factual questions like "what's the budget of Project X".

How it works

lcwiki runs three passes.

First (ingest), a deterministic Python pass converts every file in raw/inbox/ to markdown (docx via python-docx, pdf via pypdf, xlsx via openpyxl, pptx/images/video optional), extracts basic structure (headings, tables, entities) with zero LLM cost, classifies each file as new / updated (same filename, new sha → auto-cleanup old) / skipped / failed, and stages each into staging/pending/. Images are kept inline. Nothing burns tokens yet.

Second (compile), an LLM subagent reads every staged content.md and produces a structured wiki article (YAML frontmatter with tldr, doc_type, concepts, source_sha256, confidence — and a body that preserves every table, every list item, every data point from the source, not a summary). Concepts are extracted as standalone pages with 4-section bodies (概要 / 关键特征 / 在方案中的应用 / 相关概念) and family aliases — "Digital Learning Platform" and "digital-learning-platform" auto-merge to one canonical concept. Every write goes through lcwiki compile-write with a whitelist-schema compile-verify — agents can't invent frontmatter fields; the verify command rejects anything outside the schema.

Third (graph), an LLM subagent reads the compiled wiki and emits nodes (documents + concepts), edges (tagged EXTRACTED / INFERRED / AMBIGUOUS with confidence scores — never a default 0.5), and hyperedges (3+ node groupings). The results go through lcwiki graph-run, which builds a NetworkX directed graph, runs Leiden community detection for coloring, and exports interactive graph.html + persistent graph.json + plain-language GRAPH_REPORT.md. Every edge is marked EXTRACTED (found directly in source), INFERRED (reasonable inference with confidence), or AMBIGUOUS (flagged for review). You always know what was found vs guessed.

Query, and `/lcwiki audit`

/lcwiki query "what's the budget of project X?" runs the three-layer fallback: it first scans every article's tldr field (cheap — ~100 tokens each), opens only the matching article's body if tldrs are insufficient, and only falls through to the raw content.md as a last resort. Token cost per query: ~100–3000 vs 5K–20K+ for vanilla RAG on the same questions.

/lcwiki audit catches the rot that accumulates as you compile more docs: ghost nodes (nodes with no edges), orphan concepts (concepts referenced by nothing), missing source files, edges below confidence threshold. It uses an LLM-as-judge for the subjective calls and always asks for user confirmation before deleting anything. Every edit is logged, every graph change is backed up. The graph stays coherent long after your 50th /lcwiki compile.

Install

Requires: Python 3.11+ and one of: Claude Code, OpenClaw.

pip install lcwiki
lcwiki install --platform claude   # or --platform claw

Then drop some docs and go:

mkdir -p ~/.claude/lcwiki/raw/inbox
cp *.docx *.pdf ~/.claude/lcwiki/raw/inbox/
# In Claude Code:
/lcwiki ingest
/lcwiki compile
/lcwiki graph
/lcwiki query "what's the budget of project X?"

Open ~/.claude/lcwiki/vault/graph/graph.html in a browser to explore the graph — click any node to jump to its wiki article.

Optional extras

pip install 'lcwiki[leiden]'   # faster community detection (Python < 3.13)
pip install 'lcwiki[pptx]'     # PowerPoint ingestion
pip install 'lcwiki[video]'    # audio/video transcription via faster-whisper
pip install 'lcwiki[mcp]'      # Model Context Protocol server
pip install 'lcwiki[all]'      # everything

How much does it cost

On a test corpus of a few dozen proposal-style docs (~1–3 MB total):

	Cost	Notes
Compile once	~$2 (one-time)	per dozen docs, with `qwen-plus` or `claude-sonnet`
Query	~$0.01 each	tracked in `logs/cost.jsonl`; most queries stop at the tldr layer
Audit	~$0.05	full-graph health check, run weekly

Actual numbers vary by model, doc complexity, and corpus size — the numbers above are ballpark from internal tests. Your mileage will vary.

The first compile is by far the heaviest step. After that, queries are cheap enough to run in tight loops — your AI assistant can check the wiki dozens of times per conversation without thinking about cost.

Built for AI agents, not humans

Every user-facing operation is an atomic CLI subcommand that the agent invokes. The LLM never writes Python or directly edits JSON state — it calls lcwiki compile-write, lcwiki graph-run, lcwiki audit, and so on. Every write command has a matching *-verify with a whitelist schema: agents cannot invent frontmatter fields, cannot skip required concepts, cannot emit edges below the confidence floor. When agents try to shortcut the process (regex-scanning instead of actually reading the doc), the verify command rejects the output and forces a re-read.

This sounds over-engineered. It isn't. Agents cut corners constantly in ways that silently destroy your graph. The verify-gate is the only thing that made lcwiki's outputs reliable enough to trust.

How it compares

	lcwiki	graphify	LangChain / LlamaIndex
Primary output	wiki + graph	graph only	retrieval pipeline
Per-doc structured article	yes	no	no (chunks only)
Concept as standalone page	yes (4-section)	no (just label)	no
Knowledge graph	yes	yes	optional
Token cost per query	~100–3K	n/a	5K–20K+
Agent-friendly CLI + verify gate	yes	yes	no (framework)
Self-healing audit	yes (LLM-judge)	no	no
Works with Claude Code	yes	yes	yes
Works with OpenClaw	yes (out of box)	yes	yes
MIT license	yes	yes	yes

lcwiki is not a replacement for LangChain or LlamaIndex — it's a pre-step. You can absolutely point a LangChain pipeline at lcwiki's compiled wiki. Most people won't need to: the tldr + article layer answers 80% of real questions directly, and you can ship a useful AI assistant on it alone.

What's inside

lcwiki/
├── ingest.py           raw file → content.md + structure.json (zero LLM)
├── detect.py           file classification + sha256 dedup
├── convert.py          docx/pdf/xlsx/pptx → markdown
├── structure.py        headings, tables, key terms (zero LLM)
├── compile.py          LLM-driven article + concept generation (with validation)
├── compile_verify.py   whitelist-schema gate for `compile-write`
├── merge.py            concept family merging, source_file auto-heal
├── graph_cmd.py        build graph from LLM extraction
├── graph_verify.py     whitelist-schema gate for `graph-run`
├── audit.py            ghost-node / orphan-concept health check
├── query.py            three-layer token-first retrieval
├── backfill.py         retroactively enrich structure.json with LLM terms
├── _vendored_graphify/ networkx build + leiden clustering + vis-network export
└── skill*.md           agent-runtime skill definitions (Claude Code, OpenClaw)

The _vendored_graphify/ subpackage is safishamsi/graphify (MIT), vendored rather than pip-depended so end users get a single pip install with no surprises.

Roadmap

v0.5 — initial public release, three-layer query, audit, subagent-parallel compile
v0.6 — compile-write-direct (bypass agent, call LLM API directly for cheaper batch compile)
v0.7 — web admin UI (separate repo lcwiki-web)
v0.8 — multi-user support, RBAC
v1.0 — stable API, long-form doc refinement

See CHANGELOG.md for detailed history.

FAQ

How is this different from LangChain / LlamaIndex? Those are retrieval pipelines. lcwiki is a wiki-builder — a pre-step. You can (and some people will want to) stack a LangChain retriever on top of lcwiki's compiled wiki. Most don't need to.

Why three layers instead of two? We tried two (concept + graph). It couldn't answer per-doc questions like "what's the budget of Project X" because concepts are cross-doc by definition. The article layer is load-bearing.

Does my AI need to remember all the /lcwiki commands? No — that's the whole point of the atomic-CLI design. The skill file tells the AI the commands once; the AI just invokes them by name. lcwiki compile-write --kb ... --task-id ... --article ... is what an agent can learn; ad-hoc Python that writes frontmatter YAML is what an agent keeps screwing up.

Can I run compile offline? ingest, graph, and audit are pure-Python, fully offline. Only compile needs an LLM — it runs inside Claude Code / OpenClaw, which you point at whatever provider you want (Anthropic, OpenAI, Qwen, a local Ollama model, etc.).

What about bigger corpora? Tested on 42-doc corpora so far. Bigger should work — compile parallelizes with subagents (Claude Code) or runs sequentially (OpenClaw). If you hit a wall, file an issue with your doc count and format mix.

Why does this exist? Originally written out of frustration with vector RAG on structured docs. Proposals, contracts, and research reports have critical data points (budgets, dates, KPIs) buried inside tables — and chunking breaks tables. We wanted AI assistants to read each doc once, write a wiki, and answer from that wiki. The wiki is cheaper, more accurate on factual questions, and persistent across sessions.

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for the non-negotiable architecture principles: three-layer, CLI-atomic, verify everything, honest cost numbers.

github.com/LCccode/wikigraph/issues

License

MIT. Vendored _vendored_graphify/ is also MIT, originally by safishamsi/graphify.

Acknowledgments

safishamsi/graphify for the graph algorithms and the "graph of your raw folder" idea
vis-network for the interactive graph renderer
Anthropic Claude Code for the agent runtime that made the atomic-CLI design possible
Everyone who filed real issues on the first buggy versions

Made by @LCccode. If lcwiki helps your team, a ⭐ on the repo is the nicest way to say thanks.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

lcccben

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.1

Apr 18, 2026

This version

0.5.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lcwiki-0.5.0.tar.gz (154.8 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lcwiki-0.5.0-py3-none-any.whl (166.1 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file lcwiki-0.5.0.tar.gz.

File metadata

Download URL: lcwiki-0.5.0.tar.gz
Upload date: Apr 18, 2026
Size: 154.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lcwiki-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`6ca1776bf94f2301f3ea125e009b77d53c18a2c24cda7059c8fbe2ca3734fbe7`
MD5	`cea0f7d848522e09d84dc310235445a6`
BLAKE2b-256	`62c2254806fdd434ae2ac3ecea726a5bac9a5c539a3b63f87cd936c4e13b10ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lcwiki-0.5.0.tar.gz:

Publisher: publish.yml on LCccode/wikigraph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lcwiki-0.5.0.tar.gz
- Subject digest: 6ca1776bf94f2301f3ea125e009b77d53c18a2c24cda7059c8fbe2ca3734fbe7
- Sigstore transparency entry: 1339402207
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: LCccode/wikigraph@5b14eae49e6c88328dbcf8d62fc626dfde635d90
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/LCccode
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5b14eae49e6c88328dbcf8d62fc626dfde635d90
- Trigger Event: release

File details

Details for the file lcwiki-0.5.0-py3-none-any.whl.

File metadata

Download URL: lcwiki-0.5.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 166.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lcwiki-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfc6fb000279f7f8d471e85354f2e7b9e6dbfe19eb4e5e52c2877a37b68f2e86`
MD5	`dcc94fd88a84a99b4db0fb5a5febef04`
BLAKE2b-256	`f3e9ba2c8bb3fd82711176e593369af91e22869125f8e58592245ec85b70107d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for lcwiki-0.5.0-py3-none-any.whl:

Publisher: publish.yml on LCccode/wikigraph

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: lcwiki-0.5.0-py3-none-any.whl
- Subject digest: bfc6fb000279f7f8d471e85354f2e7b9e6dbfe19eb4e5e52c2877a37b68f2e86
- Sigstore transparency entry: 1339402215
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: LCccode/wikigraph@5b14eae49e6c88328dbcf8d62fc626dfde635d90
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/LCccode
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5b14eae49e6c88328dbcf8d62fc626dfde635d90
- Trigger Event: release

lcwiki 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

lcwiki

Why three layers

How it works

Query, and /lcwiki audit

Install

Optional extras

How much does it cost

Built for AI agents, not humans

How it compares

What's inside

Roadmap

FAQ

Contributing

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Query, and `/lcwiki audit`