Synthesize Wikipedia-style articles with LLM-powered research and tree search
Project description
๐จ๐ณไธญๆ | ๐English
synth-wiki
An implementation of Andrej Karpathy's idea for an LLM-compiled personal knowledge base. Written in Python.
Drop in your papers, articles, and notes. synth-wiki compiles them into a structured, interlinked wiki โ with concepts extracted, cross-references discovered, and everything searchable via a powerful tree-aware search engine.
- Your sources in, a wiki out. Add documents to a folder. The LLM reads, summarizes, extracts concepts, and writes interconnected articles.
- Cross-source syntheses. Automatically discovers patterns, contradictions, and emergent insights across multiple sources โ not just per-document summaries.
- Compounding knowledge. Every new source enriches existing articles. The wiki gets smarter as it grows.
- Ask your wiki questions. Enhanced structure-aware search powered by TreeSearch. Ask natural language questions and get cited, highly-relevant answers. Use
--archiveto save valuable Q&A as new wiki pages. - Native Chinese Support. Built-in Jieba tokenization ensures excellent retrieval for Chinese documents.
Install
From PyPI
pip install -U synth-wiki
From source
git clone https://github.com/shibing624/synth-wiki.git
cd synth-wiki
pip install -e .
Dependencies
- Python >= 3.12
- click >= 8.1
- pyyaml >= 6.0
- httpx >= 0.27
- pytreesearch >= 0.1 (auto mode tree search)
- loguru >= 0.7
Quickstart
1. Initialize a project
mkdir my-wiki && cd my-wiki
synth-wiki init
This creates:
./raw/โ drop your source files here./wiki/โ compiled wiki output~/.synth_wiki/config.yamlโ global config (auto-generated with sensible defaults)
2. Set your API key
The generated config uses OpenAI-compatible API by default. Set your environment variables:
export OPENAI_API_KEY="sk-xxx"
export OPENAI_BASE_URL="https://api.openai.com/v1" # or any compatible endpoint
Or edit ~/.synth_wiki/config.yaml directly:
api:
provider: openai-compatible
api_key: ${OPENAI_API_KEY}
base_url: ${OPENAI_BASE_URL}
models:
summarize: gpt-4o-mini
extract: gpt-4o-mini
write: gpt-4o-mini
lint: gpt-4o-mini
query: gpt-4o-mini
Other supported providers: openai, anthropic, gemini, ollama.
3. Add sources and compile
# Add source files
cp ~/papers/*.pdf raw/
cp ~/articles/*.md raw/
# Compile
synth-wiki compile
# Watch mode (auto-recompile on file changes)
synth-wiki compile --watch
# Search
synth-wiki search "attention mechanism"
Supported Source Formats
Just drop files into your source folder โ synth-wiki detects the format automatically.
| Format | Extensions | What gets extracted |
|---|---|---|
| Markdown | .md |
Body text with frontmatter parsed separately |
.pdf |
Full text via PyMuPDF | |
| Word | .docx |
Document text |
| JSON / JSONL | .json, .jsonl |
Parsed and searched structurally |
| Code | .py, .java, .go, .ts |
Source code parsed via AST and regex |
| Plain text | .txt, .csv |
Raw content |
| Images | .png, .jpg, .svg |
Image files |
Commands
| Command | Description |
|---|---|
synth-wiki init [--name] [--source] [--output] [--vault] |
Initialize project (all options optional) |
synth-wiki compile [--watch] [--dry-run] [--fresh] [--batch] [--no-cache] |
Compile sources into wiki articles |
synth-wiki search "query" [--tags] [--limit] |
Search via TreeSearch |
synth-wiki query "question" |
Q&A against the wiki |
synth-wiki ingest <url|path> |
Add a source |
synth-wiki lint [--fix] [--pass-name] |
Check and fix article quality |
synth-wiki status |
Wiki stats and health |
synth-wiki doctor |
Check config and connection status |
synth-wiki projects |
List all configured projects |
synth-wiki serve [--transport] [--port] |
Start MCP server for IDE/agent integration |
Directory Structure
~/.synth_wiki/ # Global state directory
โโโ config.yaml # Global config (shared by all projects)
โโโ db/
โ โโโ my-wiki.db # Project SQLite database (WAL mode)
โโโ manifests/
โ โโโ my-wiki.json # Source manifest and compile status
โโโ state/
โ โโโ my-wiki.json # Compile checkpoint (auto-deleted on success)
โโโ lintlog/
โโโ my-wiki/ # Lint report history
./raw/ # Source files directory (user-managed)
โโโ paper-1.md
โโโ notes.txt
โโโ captures/ # URL-ingested files
./wiki/ # Compiled output directory (auto-generated)
โโโ SCHEMA.md # Domain conventions, tag taxonomy, page thresholds
โโโ index.md # Auto-generated content catalog by type
โโโ overview.md # Auto-generated bird's-eye summary of the entire wiki
โโโ summaries/ # Pass 1: Source summaries
โโโ concepts/ # Pass 3: Wiki articles (concepts/techniques/claims)
โ โโโ transformer.md
โ โโโ self-attention.md
โโโ entities/ # Entity pages (people, orgs, products, models)
โโโ comparisons/ # Side-by-side analysis pages
โโโ syntheses/ # Pass 4: Cross-source synthesis pages
โโโ connections/ # Cross-concept relation pages (reserved)
โโโ outputs/ # Exported artifacts (JSON, graph data, etc.)
โโโ images/ # Extracted images
โโโ archive/
โโโ prompts/ # Custom prompts (optional)
โโโ CHANGELOG.md # Compile history
Knowledge Capture: Ingest and Learning
Ingest local files
synth-wiki ingest /path/to/document.md
synth-wiki ingest /path/to/notes.txt
The ingest command will:
- Detect file type by extension
- Copy the file to the project source directory
- Compute SHA-256 hash
- Register the source in the manifest as pending compilation
- Print the result (path, type, size)
Ingest URLs
synth-wiki ingest https://example.com/article
URL ingestion will:
- Download the page content via httpx (30s timeout, follows redirects)
- Wrap the content as a Markdown file with YAML frontmatter recording source URL and ingest time
- Save to source directory with a slugified filename
- Register in manifest
Compile Pipeline
Imported sources go through a multi-step compilation:
Source Files (MD/PDF/DOCX/JSON/Code/TXT/Image)
โ
โผ
โโโโโโโโโโโโโโโโโ
โ 1. Diff โ Compare manifest, detect added/modified/deleted
โโโโโโโโฌโโโโโโโโโ
โ change list
โผ
โโโโโโโโโโโโโโโโโ
โ 2. Summarize โ LLM generates summaries concurrently
โโโโโโโโฌโโโโโโโโโ
โ summaries
โผ
โโโโโโโโโโโโโโโโโ
โ 3. Extract โ LLM batch-extracts concepts, aliases, types
โ Concepts โ
โโโโโโโโฌโโโโโโโโโ
โ concepts
โผ
โโโโโโโโโโโโโโโโโ
โ 4. Write โ LLM writes wiki articles per concept
โ Articles โ auto-creates [[wikilinks]] and ontology relations
โโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโ
โ 5. Synthesizeโ Cluster summaries by shared concepts,
โ โ generate cross-source analysis pages
โโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโ
โ 6. Overview โ Generate a bird's-eye summary (overview.md)
โโโโโโโโฌโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโ
โ 7. Images โ Extract images from source files
โโโโโโโโฌโโโโโโโโโ
โ
โผ
Structured Wiki (summaries + articles + syntheses + overview + knowledge graph)
Compilation supports checkpoint resume: if interrupted, the next compile picks up from the last checkpoint. Use --fresh to ignore checkpoints and restart.
Watch Mode
synth-wiki compile --watch enables real-time file watching. When source files are added, modified, or deleted, compilation is automatically triggered.
- Primary mode: Uses watchdog (fsevents on macOS, inotify on Linux) for native file system events
- Fallback mode: If watchdog is not installed, automatically degrades to polling (2-second interval)
- Debounce: 2-second debounce to batch rapid file changes into a single compile
- Concurrency protection: Lock-based guard prevents overlapping compiles
- Initial compile: Runs one compile on startup to catch any missed changes
Install watchdog for best performance:
pip install synth-wiki[watch]
Linting
synth-wiki lint runs quality checks across all wiki articles. Use --fix to auto-fix issues where possible.
| Pass | Description | Auto-fix |
|---|---|---|
completeness |
Broken [[wikilinks]] pointing to non-existent articles |
No |
style |
Missing YAML frontmatter | Yes |
orphans |
Entities with no relations in the knowledge graph | No |
consistency |
Already-tagged contradicts relations |
No |
impute |
Placeholder markers [TODO], [UNKNOWN], [TBD] |
No |
staleness |
Articles not updated for 90+ days | No |
contradiction_detection |
Cross-source contradictions (conflicting confidence levels between articles sharing the same source) | No |
synth-wiki lint # Run all passes
synth-wiki lint --fix # Auto-fix where possible
synth-wiki lint --pass-name style # Run a single pass
Ontology: Entity-Relation Knowledge Graph
synth-wiki automatically builds an ontology (knowledge graph) during compilation, connecting concepts, techniques, and sources through typed edges. The knowledge graph is stored in SQLite with CHECK constraints for data integrity.
How it works
After Pass 3 (write_articles), _extract_relations() scans article content for [[wikilinks]] and checks for keywords near the links. If matched, a typed relation edge is created.
For example, an article about Flash Attention containing:
Flash Attention optimizes the memory access pattern of [[Self-Attention]]
Creates the edge: Flash-Attention --optimizes--> Self-Attention
Entity Types
| Type | Description | Created when |
|---|---|---|
concept |
General concept | Default type, created during Pass 3 |
technique |
Specific technique/method | Concept type is "technique" |
entity |
Person, org, product, or model | Concept type is "entity" |
comparison |
Side-by-side analysis | Concept type is "comparison" |
source |
Source file | Auto-created for each source reference |
claim |
Assertion/conclusion | Concept type is "claim" |
artifact |
Output/product | Reserved type |
Relation Types
| Type | Extraction Keywords | Description |
|---|---|---|
implements |
implements, implementation of | A implements B |
extends |
extends, extension of, builds on | A extends B |
optimizes |
optimizes, optimization of, improves upon | A optimizes B |
contradicts |
contradicts, conflicts with | A contradicts B |
cites |
(auto-created) | A cites source B |
prerequisite_of |
prerequisite, requires knowledge of | A is prerequisite of B |
trades_off |
trade-off, tradeoff, trades off | A trades off with B |
derived_from |
(auto-created) | A is derived from B |
Graph Traversal
from synth_wiki.ontology import Store, TraverseOpts, Direction
store = Store(db)
# BFS traversal from an entity, following outbound edges, depth 2
neighbors = store.traverse("flash-attention", TraverseOpts(
direction=Direction.OUTBOUND,
max_depth=2,
))
# Inbound edges (who points to this entity)
inbound = store.traverse("self-attention", TraverseOpts(
direction=Direction.INBOUND,
max_depth=1,
))
# Bidirectional with relation type filter
both = store.traverse("transformer", TraverseOpts(
direction=Direction.BOTH,
max_depth=1,
relation_type="extends",
))
Cycle Detection
cycles = store.detect_cycles("flash-attention")
# Returns: [["flash-attention", "B", "C", "flash-attention"], ...]
Details: Configurable Relations
Search Quality Powered by TreeSearch
synth-wiki uses a hybrid search pipeline combining TreeSearch auto mode (Best-First tree walk + FTS5) with optional vector semantic search, fused via Reciprocal Rank Fusion (RRF).
How it works
User Query
-> TreeSearch auto mode (tree walk with FTS5 scoring, auto flat/tree routing)
-> Optional vector cosine similarity search (if embedder configured)
-> RRF fusion (K=60, combining TreeSearch rank + vector rank)
-> Tag boost (+3% per matching tag, capped at 15%)
-> Recency boost (14-day half-life, max +5%)
-> Return top-N results sorted by combined score
TreeSearch Auto Mode
Unlike traditional vector RAG that chops documents into chunks and loses context, synth-wiki uses TreeSearch in auto mode โ the most effective search strategy.
- Best-First Tree Walk: Anchor retrieval โ expansion โ path scoring over document tree structures, not just flat BM25 ranking
- Auto Mode Routing: Automatically switches between Tree search (articles, papers, markdown) and Flat search (code files) based on document
source_type - Document Routing: Multi-document queries first route to top-K relevant documents via FTS5, then perform deep tree search within each
- No Embeddings Required: Millisecond-level structure-aware matching with intelligent cross-document scoring
- Excellent CJK Support: Integrated jieba tokenizer for high-quality Chinese text retrieval
Vector Search (Optional)
When an embedding provider is configured, synth-wiki generates vector embeddings for summaries and concept articles. At query time, the query is also embedded and compared via brute-force cosine similarity.
Supported embedding providers (auto-detected in cascade order):
- Explicit config โ
embed.modelin config.yaml - Provider default โ OpenAI:
text-embedding-3-small, Gemini:gemini-embedding-2-preview, Voyage:voyage-3-lite, Mistral:mistral-embed - Ollama local โ
nomic-embed-text(auto-detected if Ollama is running) - None โ TreeSearch-only search (still fully functional)
RRF Fusion
TreeSearch and vector results are fused using Reciprocal Rank Fusion:
score(doc) = 1/(K + treesearch_rank) + 1/(K + vector_rank)
Where K = 60. Documents ranked highly by either method receive a strong combined score.
Fallback Behavior
The search pipeline degrades gracefully:
- No embedder configured โ TreeSearch-only. Auto mode tree walk still delivers high quality.
- Empty index โ Returns empty results.
- Vector dimensions mismatch โ Mismatched vectors are silently skipped.
- Ollama not running โ Falls back to API embedding or TreeSearch-only.
Details: Search Quality (EN) | ๆ็ดข่ดจ้ (ไธญๆ)
Configuration
Full config example
# Global API config
api:
provider: openai-compatible # openai, anthropic, gemini, ollama, openai-compatible
api_key: ${OPENAI_API_KEY} # Supports ${ENV_VAR} expansion
base_url: ${OPENAI_BASE_URL}
rate_limit: 0 # Requests per minute, 0 = unlimited
extra_body: {}
# Models (per compilation stage)
models:
summarize: gpt-4o-mini # Pass 1: Summarize
extract: gpt-4o-mini # Pass 2: Concept extraction
write: gpt-4o-mini # Pass 3: Article writing
lint: gpt-4o-mini # Linter
query: gpt-4o-mini # Query (reserved)
# Embedding config (optional, auto-cascades if not set)
embed:
provider: auto # auto, openai, gemini, voyage, mistral, ollama
model: "" # Empty = use provider default
dimensions: 0 # 0 = auto-detect
api_key: "" # Empty = reuse api.api_key
base_url: ""
# Compiler config
compiler:
max_parallel: 4 # Max concurrent LLM calls per phase
debounce_seconds: 2
summary_max_tokens: 2000
article_max_tokens: 4000
auto_commit: true # Auto git commit after compile
auto_lint: true # Auto lint after compile
mode: "" # standard, batch, auto
prompt_cache: null # null=true (cache enabled by default)
page_threshold: 1 # Min source count to create a page (2 = Karpathy rule)
# Search config
search:
default_limit: 10
# Linter config
linting:
auto_fix_passes:
- consistency
- completeness
- style
staleness_threshold_days: 90
# MCP server config
serve:
transport: stdio # stdio (for Claude Code / Cursor), sse (for web)
port: 3333 # Port for SSE transport
# Language (affects article generation language)
language: zh-CN # zh-CN, zh-TW, en, ja, ko
# Project definitions
projects:
my-wiki:
description: "Personal knowledge base"
sources:
- path: /Users/me/raw
type: auto
watch: true
output: /Users/me/wiki
Multi-project config
projects:
research:
description: "AI research notes"
sources:
- path: ~/research/raw
output: ~/research/wiki
models:
write: gpt-4o # Better model for research articles
work:
description: "Work notes"
sources:
- path: ~/work/raw
output: ~/work/wiki
Use --project to specify which project:
synth-wiki compile --project research
synth-wiki status --project work
synth-wiki search --project research "attention"
If there is only one project, --project can be omitted.
LLM Provider Examples
OpenAI:
api:
provider: openai
api_key: ${OPENAI_API_KEY}
models:
summarize: gpt-4o-mini
write: gpt-4o
Anthropic:
api:
provider: anthropic
api_key: ${ANTHROPIC_API_KEY}
models:
summarize: claude-sonnet-4
write: claude-sonnet-4
Gemini:
api:
provider: gemini
api_key: ${GEMINI_API_KEY}
models:
summarize: gemini-2.5-flash
write: gemini-2.5-flash
OpenAI-Compatible (OpenRouter, Together, Groq, etc.):
api:
provider: openai-compatible
base_url: https://openrouter.ai/api/v1
api_key: ${OPENROUTER_API_KEY}
models:
summarize: google/gemini-2.5-flash-preview
write: anthropic/claude-sonnet-4
Ollama (local, no API key needed):
api:
provider: ollama
base_url: http://localhost:11434
models:
summarize: llama3
write: llama3
Embedding Cascade
synth-wiki uses a 3-level cascade strategy to auto-select embedding provider:
| Provider | Default Model | Dimensions |
|---|---|---|
| openai | text-embedding-3-small | 1536 |
| gemini | gemini-embedding-2-preview | 768 |
| voyage | voyage-3-lite | 1024 |
| mistral | mistral-embed | 1024 |
| ollama | nomic-embed-text | 768 |
Vault Overlay Mode (Obsidian)
If you already use Obsidian, synth-wiki can overlay on your existing vault:
synth-wiki init --name my-vault --vault --source ~/my-vault --output _wiki
- Source files come from existing vault folders
- Output writes to
_wiki/subdirectory inside the vault - Obsidian can directly browse compiled output
[[wikilinks]]are compatible with Obsidian's link format
Using the wiki output as an Obsidian vault:
The compiled output directory (wiki/) works as a standalone Obsidian vault:
- Open Obsidian, choose "Open folder as vault", select your
wiki/directory [[wikilinks]]in concept articles render as clickable links- Graph View visualizes the knowledge network across all concepts
- YAML frontmatter enables Dataview queries (e.g.,
TABLE tags FROM "concepts" WHERE confidence = "high") SCHEMA.mdandindex.mdare auto-generated --index.mdserves as a navigable table of contents- Images extracted during compilation are stored in
images/-- set this as Obsidian's attachment folder
For best results, install these Obsidian plugins:
- Dataview -- for structured queries across wiki pages
- Graph Analysis -- for deeper knowledge graph exploration
MCP Server (IDE / Agent Integration)
synth-wiki includes a built-in MCP server that exposes all wiki operations as tools. This enables integration with Claude Code, Cursor, Windsurf, and any MCP-compatible client.
Quick Start
# Install MCP support
pip install synth-wiki[mcp]
# Start the server (stdio for Claude Code / Cursor)
synth-wiki serve
# Start with SSE transport (for web clients)
synth-wiki serve --transport sse --port 3333
Claude Code Integration
Add to your Claude Code MCP config (~/.claude.json or project .mcp.json):
{
"mcpServers": {
"synth-wiki": {
"command": "synth-wiki",
"args": ["serve", "--project", "my-wiki"]
}
}
}
Available MCP Tools
| Tool | Description |
|---|---|
search |
Search the wiki with TreeSearch + optional vector reranking |
query |
Ask a question and get an LLM-synthesized answer with citations |
ingest |
Add a source file or URL for compilation |
compile |
Compile sources into wiki articles |
lint |
Run quality checks on wiki articles |
status |
Show wiki statistics and health |
read_article |
Read a specific wiki article by slug name |
list_articles |
List all wiki articles, optionally filtered by type |
Example: Using from Claude Code
Once configured, you can ask Claude:
> Search my wiki for "attention mechanism"
> What does my wiki say about transformers?
> Ingest this URL into my wiki: https://example.com/paper
> Compile my wiki
> Run a lint check on my wiki
Python API
Search
from synth_wiki import DB, MemoryStore, VectorStore, Searcher, SearchOpts
from synth_wiki import paths, load_config
from synth_wiki.embed import new_from_config
cfg = load_config(paths.config_path(), "my-project")
db = DB.open(paths.db_path("my-project"))
mem = MemoryStore(paths.db_path("my-project"))
vec = VectorStore(db)
searcher = Searcher(mem, vec)
# FTS5-only search
results = searcher.search(SearchOpts(query="attention mechanism", limit=10))
# Hybrid search with vector
embedder = new_from_config(cfg)
query_vec = embedder.embed("attention mechanism") if embedder else None
results = searcher.search(SearchOpts(query="attention mechanism", limit=10), query_vec)
for r in results:
print(f"[{r.score:.4f}] {r.article_path}")
print(f" {r.content[:120]}...")
mem.close()
db.close()
Ingest
from synth_wiki.wiki import ingest_path, ingest_url
# Ingest local file
result = ingest_path("my-project", "/path/to/document.md")
print(f"Ingested: {result.source_path} ({result.type}, {result.size} bytes)")
# Ingest URL
result = ingest_url("my-project", "https://example.com/article")
print(f"Ingested: {result.source_path} ({result.type}, {result.size} bytes)")
Ontology
from synth_wiki.ontology import Store, TraverseOpts, Direction
store = Store(db)
# Add/update entity (upsert)
store.add_entity(entity)
# Get entity
entity = store.get_entity("flash-attention")
# List entities (optionally filter by type)
all_concepts = store.list_entities("concept")
# Add relation (upsert, unique on source+target+relation)
store.add_relation(relation)
# Query relations
rels = store.get_relations("flash-attention", Direction.OUTBOUND)
rels = store.get_relations("flash-attention", Direction.BOTH, "optimizes")
# Stats
store.entity_count() # Total entities
store.entity_count("concept") # Entities of specific type
store.relation_count() # Total relations
Acknowledgements
- xoai/sage-wiki โ Go implementation of llm-wiki
- Andrej Karpathy's llm-wiki idea โ the original inspiration
Community & Support
- GitHub Issues โ Open an issue
- WeChat Group โ Add WeChat
xuming624with note "nlp" to join the tech discussion group
Citation
If you use synth-wiki in your research, please cite:
@software{xu2026synthwiki,
author = {Xu, Ming},
title = {synth-wiki: LLM-Compiled Personal Knowledge Base},
year = {2026},
publisher = {GitHub},
url = {https://github.com/shibing624/synth-wiki}
}
Contributing
Contributions are welcome! Please submit a Pull Request.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file synth_wiki-0.1.3.tar.gz.
File metadata
- Download URL: synth_wiki-0.1.3.tar.gz
- Upload date:
- Size: 108.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef54d5465d4124a7d20f46ae8cc509d1d4fda700f639361fcf59d08d69b8e35f
|
|
| MD5 |
12def65cb800e83df185e8aa22a44108
|
|
| BLAKE2b-256 |
75499fbb6310aa8860a15503a1148edd5c061ff25530f2c5356cfbbf889afed2
|
File details
Details for the file synth_wiki-0.1.3-py3-none-any.whl.
File metadata
- Download URL: synth_wiki-0.1.3-py3-none-any.whl
- Upload date:
- Size: 84.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
558fe5931d9a17feb914ba8ad79fef7904e41d37358672bf04d32993505a1a35
|
|
| MD5 |
46c963c350b68a577e7dfa0dde75f5ba
|
|
| BLAKE2b-256 |
19c06ae0dbe41437c646e42952bd68700542f173cac050cf13e5caf8c6c25519
|