Synthesize Wikipedia-style articles with LLM-powered research and tree search

These details have not been verified by PyPI

Project links

Project description

synth-wiki

An implementation of Andrej Karpathy's idea for an LLM-compiled personal knowledge base. Written in Python.

Drop in your papers, articles, and notes. synth-wiki compiles them into a structured, interlinked wiki — with concepts extracted, cross-references discovered, and everything searchable via a powerful tree-aware search engine.

Your sources in, a wiki out. Add documents to a folder. The LLM reads, summarizes, extracts concepts, and writes interconnected articles.
Compounding knowledge. Every new source enriches existing articles. The wiki gets smarter as it grows.
Ask your wiki questions. Enhanced structure-aware search powered by TreeSearch. Ask natural language questions and get cited, highly-relevant answers.
Native Chinese Support. Built-in Jieba tokenization ensures excellent retrieval for Chinese documents.

Install

From PyPI

pip install -U synth-wiki

From source

git clone https://github.com/shibing624/synth-wiki.git
cd synth-wiki
pip install -e .

Dependencies

Python >= 3.12
click >= 8.1
pyyaml >= 6.0
httpx >= 0.27
pytreesearch >= 0.1 (auto mode tree search)
loguru >= 0.7

Quickstart

1. Initialize a project

mkdir my-wiki && cd my-wiki
synth-wiki init

This creates:

./raw/ — drop your source files here
./wiki/ — compiled wiki output
~/.synth_wiki/config.yaml — global config (auto-generated with sensible defaults)

2. Set your API key

The generated config uses OpenAI-compatible API by default. Set your environment variables:

export OPENAI_API_KEY="sk-xxx"
export OPENAI_BASE_URL="https://api.openai.com/v1"  # or any compatible endpoint

Or edit ~/.synth_wiki/config.yaml directly:

api:
  provider: openai-compatible
  api_key: ${OPENAI_API_KEY}
  base_url: ${OPENAI_BASE_URL}

models:
  summarize: gpt-4o-mini
  extract: gpt-4o-mini
  write: gpt-4o-mini
  lint: gpt-4o-mini
  query: gpt-4o-mini

Other supported providers: openai, anthropic, gemini, ollama.

3. Add sources and compile

# Add source files
cp ~/papers/*.pdf raw/
cp ~/articles/*.md raw/

# Compile
synth-wiki compile

# Watch mode (auto-recompile on file changes)
synth-wiki compile --watch

# Search
synth-wiki search "attention mechanism"

Supported Source Formats

Just drop files into your source folder — synth-wiki detects the format automatically.

Format	Extensions	What gets extracted
Markdown	`.md`	Body text with frontmatter parsed separately
PDF	`.pdf`	Full text via PyMuPDF
Word	`.docx`	Document text
JSON / JSONL	`.json`, `.jsonl`	Parsed and searched structurally
Code	`.py`, `.java`, `.go`, `.ts`	Source code parsed via AST and regex
Plain text	`.txt`, `.csv`	Raw content
Images	`.png`, `.jpg`, `.svg`	Image files

Commands

Command	Description
`synth-wiki init [--name] [--source] [--output] [--vault]`	Initialize project (all options optional)
`synth-wiki compile [--watch] [--dry-run] [--fresh] [--batch] [--no-cache]`	Compile sources into wiki articles
`synth-wiki search "query" [--tags] [--limit]`	Search via TreeSearch
`synth-wiki query "question"`	Q&A against the wiki
`synth-wiki ingest <url\|path>`	Add a source
`synth-wiki lint [--fix] [--pass-name]`	Check and fix article quality
`synth-wiki status`	Wiki stats and health
`synth-wiki doctor`	Check config and connection status
`synth-wiki projects`	List all configured projects

Directory Structure

~/.synth_wiki/                    # Global state directory
├── config.yaml                   # Global config (shared by all projects)
├── db/
│   └── my-wiki.db                # Project SQLite database (WAL mode)
├── manifests/
│   └── my-wiki.json              # Source manifest and compile status
├── state/
│   └── my-wiki.json              # Compile checkpoint (auto-deleted on success)
└── lintlog/
    └── my-wiki/                  # Lint report history

./raw/                            # Source files directory (user-managed)
├── paper-1.md
├── notes.txt
└── captures/                     # URL-ingested files

./wiki/                           # Compiled output directory (auto-generated)
├── summaries/                    # Pass 1: Source summaries
├── concepts/                     # Pass 3: Wiki articles
│   ├── transformer.md
│   └── self-attention.md
├── connections/
├── outputs/
├── images/                       # Extracted images
├── archive/
├── prompts/                      # Custom prompts (optional)
└── CHANGELOG.md                  # Compile history

Knowledge Capture: Ingest and Learning

Ingest local files

synth-wiki ingest /path/to/document.md
synth-wiki ingest /path/to/notes.txt

The ingest command will:

Detect file type by extension
Copy the file to the project source directory
Compute SHA-256 hash
Register the source in the manifest as pending compilation
Print the result (path, type, size)

Ingest URLs

synth-wiki ingest https://example.com/article

URL ingestion will:

Download the page content via httpx (30s timeout, follows redirects)
Wrap the content as a Markdown file with YAML frontmatter recording source URL and ingest time
Save to source directory with a slugified filename
Register in manifest

Compile Pipeline

Imported sources go through a 4-step compilation:

Diff — Compare manifest against actual files, detect added/modified/deleted sources
Summarize — LLM generates a summary for each source
Extract Concepts — LLM extracts concepts, aliases, and types from summaries
Write Articles — LLM writes wiki articles for each concept, auto-creating wikilinks and ontology relations

Compilation supports checkpoint resume: if interrupted, the next compile picks up from the last checkpoint. Use --fresh to ignore checkpoints and restart.

Details: Knowledge Capture: Ingest and Learning

Watch Mode

synth-wiki compile --watch enables real-time file watching. When source files are added, modified, or deleted, compilation is automatically triggered.

Primary mode: Uses watchdog (fsevents on macOS, inotify on Linux) for native file system events
Fallback mode: If watchdog is not installed, automatically degrades to polling (2-second interval)
Debounce: 2-second debounce to batch rapid file changes into a single compile
Concurrency protection: Lock-based guard prevents overlapping compiles
Initial compile: Runs one compile on startup to catch any missed changes

Install watchdog for best performance:

pip install synth-wiki[watch]

Details: Installation and Configuration

Ontology: Entity-Relation Knowledge Graph

synth-wiki automatically builds an ontology (knowledge graph) during compilation, connecting concepts, techniques, and sources through typed edges. The knowledge graph is stored in SQLite with CHECK constraints for data integrity.

How it works

After Pass 3 (write_articles), _extract_relations() scans article content for [[wikilinks]] and checks for keywords near the links. If matched, a typed relation edge is created.

For example, an article about Flash Attention containing:

Flash Attention optimizes the memory access pattern of [[Self-Attention]]

Creates the edge: Flash-Attention --optimizes--> Self-Attention

Entity Types

Type	Description	Created when
`concept`	General concept	Default type, created during Pass 3
`technique`	Specific technique/method	Concept type is "technique"
`source`	Source file	Auto-created for each source reference
`claim`	Assertion/conclusion	Concept type is "claim"
`artifact`	Output/product	Reserved type

Relation Types

Type	Extraction Keywords	Description
`implements`	implements, implementation of	A implements B
`extends`	extends, extension of, builds on	A extends B
`optimizes`	optimizes, optimization of, improves upon	A optimizes B
`contradicts`	contradicts, conflicts with	A contradicts B
`cites`	(auto-created)	A cites source B
`prerequisite_of`	prerequisite, requires knowledge of	A is prerequisite of B
`trades_off`	trade-off, tradeoff, trades off	A trades off with B
`derived_from`	(auto-created)	A is derived from B

Graph Traversal

from synth_wiki.ontology import Store, TraverseOpts, Direction

store = Store(db)

# BFS traversal from an entity, following outbound edges, depth 2
neighbors = store.traverse("flash-attention", TraverseOpts(
    direction=Direction.OUTBOUND,
    max_depth=2,
))

# Inbound edges (who points to this entity)
inbound = store.traverse("self-attention", TraverseOpts(
    direction=Direction.INBOUND,
    max_depth=1,
))

# Bidirectional with relation type filter
both = store.traverse("transformer", TraverseOpts(
    direction=Direction.BOTH,
    max_depth=1,
    relation_type="extends",
))

Cycle Detection

cycles = store.detect_cycles("flash-attention")
# Returns: [["flash-attention", "B", "C", "flash-attention"], ...]

Details: Configurable Relations

Search Quality Powered by TreeSearch

synth-wiki uses a hybrid search pipeline combining TreeSearch auto mode (Best-First tree walk + FTS5) with optional vector semantic search, fused via Reciprocal Rank Fusion (RRF).

How it works

User Query
  -> TreeSearch auto mode (tree walk with FTS5 scoring, auto flat/tree routing)
  -> Optional vector cosine similarity search (if embedder configured)
  -> RRF fusion (K=60, combining TreeSearch rank + vector rank)
  -> Tag boost (+3% per matching tag, capped at 15%)
  -> Recency boost (14-day half-life, max +5%)
  -> Return top-N results sorted by combined score

TreeSearch Auto Mode

Unlike traditional vector RAG that chops documents into chunks and loses context, synth-wiki uses TreeSearch in auto mode — the most effective search strategy.

Best-First Tree Walk: Anchor retrieval → expansion → path scoring over document tree structures, not just flat BM25 ranking
Auto Mode Routing: Automatically switches between Tree search (articles, papers, markdown) and Flat search (code files) based on document source_type
Document Routing: Multi-document queries first route to top-K relevant documents via FTS5, then perform deep tree search within each
No Embeddings Required: Millisecond-level structure-aware matching with intelligent cross-document scoring
Excellent CJK Support: Integrated jieba tokenizer for high-quality Chinese text retrieval

Vector Search (Optional)

When an embedding provider is configured, synth-wiki generates vector embeddings for summaries and concept articles. At query time, the query is also embedded and compared via brute-force cosine similarity.

Supported embedding providers (auto-detected in cascade order):

Explicit config — embed.model in config.yaml
Provider default — OpenAI: text-embedding-3-small, Gemini: gemini-embedding-2-preview, Voyage: voyage-3-lite, Mistral: mistral-embed
Ollama local — nomic-embed-text (auto-detected if Ollama is running)
None — TreeSearch-only search (still fully functional)

RRF Fusion

TreeSearch and vector results are fused using Reciprocal Rank Fusion:

score(doc) = 1/(K + treesearch_rank) + 1/(K + vector_rank)

Where K = 60. Documents ranked highly by either method receive a strong combined score.

Fallback Behavior

The search pipeline degrades gracefully:

No embedder configured — TreeSearch-only. Auto mode tree walk still delivers high quality.
Empty index — Returns empty results.
Vector dimensions mismatch — Mismatched vectors are silently skipped.
Ollama not running — Falls back to API embedding or TreeSearch-only.

Details: Search Quality (EN) | 搜索质量 (中文)

Configuration

Full config example

# Global API config
api:
  provider: openai-compatible      # openai, anthropic, gemini, ollama, openai-compatible
  api_key: ${OPENAI_API_KEY}       # Supports ${ENV_VAR} expansion
  base_url: ${OPENAI_BASE_URL}
  rate_limit: 0                    # Requests per minute, 0 = unlimited
  extra_body: {}

# Models (per compilation stage)
models:
  summarize: gpt-4o-mini           # Pass 1: Summarize
  extract: gpt-4o-mini             # Pass 2: Concept extraction
  write: gpt-4o-mini               # Pass 3: Article writing
  lint: gpt-4o-mini                # Linter
  query: gpt-4o-mini               # Query (reserved)

# Embedding config (optional, auto-cascades if not set)
embed:
  provider: auto                   # auto, openai, gemini, voyage, mistral, ollama
  model: ""                        # Empty = use provider default
  dimensions: 0                    # 0 = auto-detect
  api_key: ""                      # Empty = reuse api.api_key
  base_url: ""

# Compiler config
compiler:
  max_parallel: 4                  # Parallel article writes
  debounce_seconds: 2
  summary_max_tokens: 2000
  article_max_tokens: 4000
  auto_commit: true                # Auto git commit after compile
  auto_lint: true                  # Auto lint after compile
  mode: ""                         # standard, batch, auto
  prompt_cache: null               # null=true (cache enabled by default)

# Search config
search:
  default_limit: 10

# Linter config
linting:
  auto_fix_passes:
    - consistency
    - completeness
    - style
  staleness_threshold_days: 90

# Language (affects article generation language)
language: zh-CN                    # zh-CN, zh-TW, en, ja, ko

# Project definitions
projects:
  my-wiki:
    description: "Personal knowledge base"
    sources:
      - path: /Users/me/raw
        type: auto
        watch: true
    output: /Users/me/wiki

Multi-project config

projects:
  research:
    description: "AI research notes"
    sources:
      - path: ~/research/raw
    output: ~/research/wiki
    models:
      write: gpt-4o              # Better model for research articles

  work:
    description: "Work notes"
    sources:
      - path: ~/work/raw
    output: ~/work/wiki

Use --project to specify which project:

synth-wiki compile --project research
synth-wiki status --project work
synth-wiki search --project research "attention"

If there is only one project, --project can be omitted.

LLM Provider Examples

OpenAI:

api:
  provider: openai
  api_key: ${OPENAI_API_KEY}
models:
  summarize: gpt-4o-mini
  write: gpt-4o

Anthropic:

api:
  provider: anthropic
  api_key: ${ANTHROPIC_API_KEY}
models:
  summarize: claude-sonnet-4
  write: claude-sonnet-4

Gemini:

api:
  provider: gemini
  api_key: ${GEMINI_API_KEY}
models:
  summarize: gemini-2.5-flash
  write: gemini-2.5-flash

OpenAI-Compatible (OpenRouter, Together, Groq, etc.):

api:
  provider: openai-compatible
  base_url: https://openrouter.ai/api/v1
  api_key: ${OPENROUTER_API_KEY}
models:
  summarize: google/gemini-2.5-flash-preview
  write: anthropic/claude-sonnet-4

Ollama (local, no API key needed):

api:
  provider: ollama
  base_url: http://localhost:11434
models:
  summarize: llama3
  write: llama3

Embedding Cascade

synth-wiki uses a 3-level cascade strategy to auto-select embedding provider:

Provider	Default Model	Dimensions
openai	text-embedding-3-small	1536
gemini	gemini-embedding-2-preview	768
voyage	voyage-3-lite	1024
mistral	mistral-embed	1024
ollama	nomic-embed-text	768

Vault Overlay Mode (Obsidian)

If you already use Obsidian, synth-wiki can overlay on your existing vault:

synth-wiki init --name my-vault --vault --source ~/my-vault --output _wiki

Source files come from existing vault folders
Output writes to _wiki/ subdirectory inside the vault
Obsidian can directly browse compiled output
[[wikilinks]] are compatible with Obsidian's link format

Python API

Search

from synth_wiki import DB, MemoryStore, VectorStore, Searcher, SearchOpts
from synth_wiki import paths, load_config
from synth_wiki.embed import new_from_config

cfg = load_config(paths.config_path(), "my-project")
db = DB.open(paths.db_path("my-project"))
mem = MemoryStore(paths.db_path("my-project"))
vec = VectorStore(db)
searcher = Searcher(mem, vec)

# FTS5-only search
results = searcher.search(SearchOpts(query="attention mechanism", limit=10))

# Hybrid search with vector
embedder = new_from_config(cfg)
query_vec = embedder.embed("attention mechanism") if embedder else None
results = searcher.search(SearchOpts(query="attention mechanism", limit=10), query_vec)

for r in results:
    print(f"[{r.score:.4f}] {r.article_path}")
    print(f"  {r.content[:120]}...")

mem.close()
db.close()

Ingest

from synth_wiki.wiki import ingest_path, ingest_url

# Ingest local file
result = ingest_path("my-project", "/path/to/document.md")
print(f"Ingested: {result.source_path} ({result.type}, {result.size} bytes)")

# Ingest URL
result = ingest_url("my-project", "https://example.com/article")
print(f"Ingested: {result.source_path} ({result.type}, {result.size} bytes)")

Ontology

from synth_wiki.ontology import Store, TraverseOpts, Direction

store = Store(db)

# Add/update entity (upsert)
store.add_entity(entity)

# Get entity
entity = store.get_entity("flash-attention")

# List entities (optionally filter by type)
all_concepts = store.list_entities("concept")

# Add relation (upsert, unique on source+target+relation)
store.add_relation(relation)

# Query relations
rels = store.get_relations("flash-attention", Direction.OUTBOUND)
rels = store.get_relations("flash-attention", Direction.BOTH, "optimizes")

# Stats
store.entity_count()           # Total entities
store.entity_count("concept")  # Entities of specific type
store.relation_count()         # Total relations

Acknowledgements

xoai/sage-wiki — Go implementation of llm-wiki
Andrej Karpathy's llm-wiki idea — the original inspiration

License

Apache License 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Apr 13, 2026

0.1.2

Apr 11, 2026

This version

0.1.0

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synth_wiki-0.1.0.tar.gz (84.9 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

synth_wiki-0.1.0-py3-none-any.whl (66.4 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file synth_wiki-0.1.0.tar.gz.

File metadata

Download URL: synth_wiki-0.1.0.tar.gz
Upload date: Apr 11, 2026
Size: 84.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for synth_wiki-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`101e748afc811748a77b7194dd304bb6c6b0fc73321436f59b0c56a11fc2ae9b`
MD5	`d297bea3dc7874874a6ae47034924f44`
BLAKE2b-256	`a6f4a1c1a758a1f711881faf6d1b91781abd4144b23ac1de5635989ca29b7566`

See more details on using hashes here.

File details

Details for the file synth_wiki-0.1.0-py3-none-any.whl.

File metadata

Download URL: synth_wiki-0.1.0-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 66.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.5

File hashes

Hashes for synth_wiki-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e377b808b313aa54ad40287008be249b79452c49deb56a36728202099cd68f7`
MD5	`d674460ba850b9003e0a2257c6c5d638`
BLAKE2b-256	`3719b12ac5fefd5d6297ed781a66248368db7a3dceefde3476f94bd6ae3651ef`

See more details on using hashes here.

synth-wiki 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

synth-wiki

Install

From PyPI

From source

Dependencies

Quickstart

1. Initialize a project

2. Set your API key

3. Add sources and compile

Supported Source Formats

Commands

Directory Structure

Knowledge Capture: Ingest and Learning

Ingest local files

Ingest URLs

Compile Pipeline

Watch Mode

Ontology: Entity-Relation Knowledge Graph

How it works

Entity Types

Relation Types

Graph Traversal

Cycle Detection

Search Quality Powered by TreeSearch

How it works

TreeSearch Auto Mode

Vector Search (Optional)

RRF Fusion

Fallback Behavior

Configuration

Full config example

Multi-project config

LLM Provider Examples

Embedding Cascade

Vault Overlay Mode (Obsidian)

Python API

Search

Ingest

Ontology

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes