LLM-maintained knowledge bases with deterministic linking

These details have not been verified by PyPI

Project links

Project description

WikiLoom

LLM-maintained knowledge bases with deterministic linking

WikiLoom turns raw documents into a persistent, compounding knowledge base. Ingest a PDF, markdown file, or URL — the LLM reads the source and writes structured wiki pages with deterministic linking, structural provenance, and human-edit protection. Every operation is committed to git automatically.

Inspired by Andrej Karpathy's LLM wiki gist.

Why WikiLoom vs. naive RAG? Instead of re-embedding documents into an opaque vector store, WikiLoom builds a persistent, human-readable knowledge graph — deterministic wikilinking, structural provenance back to source chunks, and atomic git commits on every operation.

Heads-up: WikiLoom calls paid LLM APIs by default. Anthropic, OpenAI, and Google providers cost money — typically cents per document ingested. A pre-flight budget check refuses runs that would exceed monthly_budget_usd in wikiloom.toml (default $50/mo). For zero-cost local operation, use the ollama provider — see Provider options.

How it works
Installation
Quick start
Concepts you should know
Commands
Project structure
Configuration
Provider options
Workflows
Tips
Development
Contributing
License

How it works

The LLM handles judgment (reading sources, extracting claims, assessing confidence). Everything after the LLM call is deterministic: linking, backlink graph, index regeneration, git commit. Every WikiLoom command that modifies state auto-commits with a classifying prefix (ingest:, lint:, merge:, etc.) so you never have to type git.

Installation

Supported Python versions:

Linux, Windows, Apple Silicon Macs: Python 3.10–3.13.
Intel Macs: Python 3.10–3.12. onnxruntime (a transitive dependency for embeddings) no longer publishes Intel macOS wheels for Python 3.13+.
Python 3.14: not yet supported on any platform — spaCy hasn't published a 3.14 wheel.

pip install wikiloom

# Required for the linking engine
python -m spacy download en_core_web_sm

API keys are managed per-project via a .env file created during wikiloom init (see Quick start). If you prefer shell exports, those still work and take precedence over .env.

From source

git clone https://github.com/do-y-lee/wikiloom.git && cd wikiloom
pip install -e ".[dev]"
python -m spacy download en_core_web_sm

Quick start

# 1. Create a project with your preferred LLM provider.
#    Presets: anthropic (default), openai, google, ollama.
#    Init prompts you to paste your API key into .env (skippable).
wikiloom init my-wiki --domain "AI research" --provider anthropic
cd my-wiki

# 2. Ingest a source
wikiloom ingest path/to/paper.pdf

# 3. See what was created
wikiloom status
ls wiki/concepts/ wiki/entities/ wiki/sources/

# 4. Ask a question
wikiloom query "What are the key contributions of this paper?"

# 5. Save the answer as a synthesis page (re-usable, queryable)
wikiloom query --save-last

# 6. Inspect a page's metadata
wikiloom show concepts/transformer

That's it. Every step above auto-commits to git.

Heads-up on first-run downloads: the default fastembed embedding model (~66MB) is downloaded once and cached in a durable per-user location (~/Library/Caches/wikiloom/fastembed on macOS, ~/.cache/wikiloom/fastembed on Linux, %LOCALAPPDATA%\wikiloom\Cache\fastembed on Windows). wikiloom init offers to fetch it up front so the slow step is predictable; if you decline (or skip with --no-interactive), the first command that needs embeddings (ingest, query, related) will download it instead. Subsequent calls reuse the cached weights. To use a different backend or disable embeddings, see Provider options.

Tip on cost: ingest is the token-heavy operation. For a significant saving, configure a cheap model for ingest and a stronger model for query reasoning in wikiloom.toml (see Configuration):

[llm]
default_model = "claude-sonnet-4-6"
ingest_model  = "claude-haiku-4-5-20251001"
query_model   = "claude-sonnet-4-6"

Concepts you should know

Page lifecycle: active, dormant, deprecated

Every page has one of three statuses:

active — current, in active use, surfaced everywhere
dormant — older than its time window, but still visible and usable. Dormant is informational ("you might want to refresh this"), not a verdict on usefulness. Marking is a user action via wikiloom dormant <page>.
deprecated — retired. Page moves to wiki/archive/, hidden from most workflows. Reached via wikiloom merge or wikiloom deprecate. Permanent removal via wikiloom purge (which requires deprecation first).

Lifecycle: active → dormant (optional) → deprecated → purged (gone).

Two layers of human-edit protection

When you edit a page by hand and run wikiloom save:

Commit prefix (human-edit:) — a soft, short-term protection. lint --fix skips the page; auto-tools leave it alone. Cleared by the next auto-action (e.g. a re-ingest).
The  marker — a durable boundary. Anything above the marker survives every operation, including wikiloom ingest <file> --force (the only command that wipes the auto region).

For normal updates (re-ingesting a different source that updates the page), new content is appended to the auto region — your edits anywhere on the page survive. The marker only matters when you re-synthesize from scratch via --force.

Tip: to pin a permanent personal note, put it above the marker:

# Transformer

> **My note:** the original paper used post-norm; modern impls use pre-norm.

<!-- wikiloom:auto -->

## Architecture

... (LLM-generated content)

Structural provenance

Every chunk of a source document is persisted to a SQLite cache with a stable chunk_id derived from sha256(source_hash + chunk_index). Pages reference their contributing chunks under each entry in their sources frontmatter array — every source dict carries its own chunk_ids list. So you can trace every claim back to a specific chunk of a specific document.

wikiloom show concepts/transformer --field sources   # see contributing sources
wikiloom source <chunk_id>                            # see the original chunk text

Auto-commits and `wikiloom save`

Every command that modifies wiki content auto-commits with a classifying prefix:

Prefix	Created by
`init:`	`wikiloom init`
`ingest:`	`wikiloom ingest`
`lint:`	`wikiloom lint --fix`
`relink:` / `review:` / `related:`	linker workflow commands
`merge:` / `deprecate:`	page lifecycle commands
`dormant:`	`wikiloom dormant` mark/unmark
`human-edit:`	you, via `wikiloom save` after editing pages, `wikiloom.toml`, or prompts by hand

Writer commands also block if you have uncommitted edits under wiki/, telling you to run wikiloom save first — so manual page edits never accidentally land inside an ingest: commit. Dirty wikiloom.toml or prompt edits produce a passive nudge but don't block, since they can't collide with an auto-commit's output.

Tiered linking confidence

The linker scores each potential wikilink on a 0–100 scale:

High (≥ 95): auto-inserted into the page body
Medium (≥ 85): auto-inserted, flagged in backlinks.json
Low (≥ 70): deferred to pending.json for review via wikiloom review
Below 70: ignored

Configurable in wikiloom.toml under [linking].

Per-chunk page context (Layer 1)

When ingesting a new source, the synthesis loop embeds each chunk and retrieves the top-K most semantically similar existing pages. The LLM sees this list when deciding whether to UPDATE an existing page or CREATE a new one — reduces duplicate page creation without any code-side merging.

Disable per-run with wikiloom ingest <file> --no-page-context or per-project via [ingest] use_page_context = false.

Budget enforcement

Before running synthesis, ingest estimates the token cost and refuses if it would exceed [llm] monthly_budget_usd in wikiloom.toml. After the run, if month-to-date spend exceeds the budget, a stderr warning fires (no mid-run abort — pre-flight is the only enforcement point).

Disable with [ingest] enable_budget_check = false.

Commands

26 commands grouped by purpose. All commands accept --project <path> (defaults to walking upward from the current directory to find wikiloom.toml).

Run wikiloom --help for the command list and wikiloom <command> --help for a specific command's flags (e.g. wikiloom query --help, wikiloom ingest --help).

Project lifecycle

Command	Description
`wikiloom init <name> [--domain <text>] [--provider <id>] [--model <id>] [--no-interactive]`	Create a new project: directory tree, config, scaffolded indexes, git repo, and per-project README. `--provider` picks from `anthropic` (default), `openai`, `google`, `ollama`. An interactive prompt offers to paste your API key into `.env`; `--no-interactive` skips it (CI-friendly)
`wikiloom save [-m "msg"] [--dry-run]`	Commit your manual edits with a `human-edit:` prefix. Covers pages under `wiki/`, `wikiloom.toml`, and prompts under `.wikiloom/prompts/` — one command for every human-editable file. Auto-bumps `frontmatter.modified`, freshens dormant → active
`wikiloom rebuild-cache`	Regenerate the SQLite query cache from manifest + frontmatter. Required after switching the `[embeddings]` provider or model so existing pages get re-embedded in the new vector space; otherwise occasional recovery tool

Ingestion

Command	Description
`wikiloom ingest <file-or-url> [--force] [--no-page-context]`	Ingest a source, synthesize pages, link, commit. `--force` re-runs even if the source was already ingested. `--no-page-context` disables per-chunk semantic retrieval for this run

What ingests well

Tier	Formats	Notes
Best — clean extraction, strong synthesis	`.md`, `.txt`, `.rst`, text-based `.pdf`, URLs (`http://`, `https://`)	Markdown / plain text are native. PDFs need a text layer — scanned PDFs extract as empty (no OCR). The URL extractor strips nav, ads, and boilerplate before synthesis.
Good — supported with caveats	`.docx`, `.pptx`, code files (`.py`, `.sql`, `.js`, `.ts`, `.tsx`, `.jsx`, `.go`, `.rs`, `.java`, `.rb`, `.cs`, `.cpp`, `.c`, `.sh`), config / IaC (`.yaml`, `.yml`, `.json`, `.toml`, `.dockerfile`, `.tf`, `.hcl`, `.proto`, `.graphql`)	Office docs flatten tables / layout and skip embedded images. Code and config ingest as plain text with language context — strong on docs, comments, and schemas; weaker on pure algorithmic code (the LLM tends to re-describe behavior rather than extract domain knowledge).
Not supported today	`.xls`, `.xlsx`, `.csv` (large), images (`.png`, `.jpg`, `.jpeg`, `.webp`, `.gif`), standalone `.html`	Excel: export to `.csv` or `.md` first. Large CSV tables don't synthesize well — small ones may work as plain text. Images currently emit a placeholder (no vision / OCR). For HTML, host at a URL and ingest the URL instead.

URL ingestion: wikiloom ingest https://example.com/page works on static HTML sites — documentation, blog posts, Wikipedia, most MkDocs/Docusaurus/Sphinx-rendered docs. The http:// or https:// scheme is required — bare hostnames like example.com/page are treated as local file paths and fail with "No such file." It does not work on:

JavaScript-rendered pages (React / Vue / Next.js client-side apps, most modern product pages)
Paywalled or login-gated content
Sites with bot protection / WAF (most banks, Cloudflare-protected sites)

For unsupported pages, download as PDF and ingest the PDF instead. URL ingests go through the same extract → synthesize → link → commit pipeline as files; dedup keys on the hash of the extracted text so re-ingesting the same URL with unchanged content is a cheap no-op.

Reading and exploring

Command	Description
`wikiloom query "<question>" [--detail] [--max-pages N]`	Ask a question grounded in wiki content. `--detail` shows sources, confidence, and last-modified per source
`wikiloom query --last-detail`	Show detail for the most recent query (no LLM call)
`wikiloom query --save-last`	Save the most recent answer as a `wiki/syntheses/` page
`wikiloom queries [--show <id>] [--save <id>] [--all]`	Browse the rolling cache of past `query` runs. Default lists the 20 most recent (id, timestamp, question snippet, confidence). `--show` prints the full answer + sources for an entry (no LLM call); `--save` promotes that entry to a synthesis page. Retention controlled by `[query] history_size`
`wikiloom show <page> [--field <name>] [--json]`	Show a page's frontmatter. `--field` extracts one field; `chunk_ids` flattens across sources
`wikiloom links <page>`	Show all pages linked to and from a given page
`wikiloom related <page> [-n N] [--save] [--link]`	Find pages semantically similar to one. `--save` writes them into frontmatter; `--link` appends a "Related Pages" wikilink section to the body
`wikiloom orphans`	List pages with no inbound or outbound wikilinks
`wikiloom duplicates [--review] [--auto-merge]`	Find near-duplicate pairs by slug fuzzy match + embedding cosine. `--review` walks each pair interactively; `--auto-merge` batches obvious singular/plural variants
`wikiloom source <chunk_id>`	Print the exact source text the LLM saw for a chunk

Page lifecycle

Command	Description
`wikiloom merge <loser> <winner> [--yes]`	Combine two pages — LOSER first, WINNER second (matches "merge X into Y"). Union bodies (preserving human regions), rewrite inbound `[[loser]]` wikilinks to `[[winner]]`, deprecate the loser
`wikiloom deprecate <page> [--superseded-by <other>] [--yes]`	Soft-remove a page: move to `wiki/archive/`, set `status: deprecated`. With `--superseded-by`, also rewrites every inbound `[[X]]` wikilink across non-archived pages to the replacement
`wikiloom purge <page> [--yes]`	Permanently remove an already-deprecated page (deletes the archive file AND the manifest entry). Requires typed confirmation by default
`wikiloom dormant`	List candidates (active pages past their window)
`wikiloom dormant --list-marked`	List currently-marked dormant pages
`wikiloom dormant --windows`	Show window config by type
`wikiloom dormant <page> [--unmark]`	Manually mark/unmark a page as dormant
`wikiloom dormant --review`	Walk through dormant candidates interactively

Maintenance

Command	Description
`wikiloom lint [--fix]`	Run health checks (broken links, missing frontmatter, duplicates, dormant candidates). Default is check-only — prints a report and exits 1 if issues are found. `--fix` applies auto-repairs (broken links, frontmatter only — never auto-marks dormant)
`wikiloom relink`	Re-run the linker across every page (useful when new pages were added that earlier pages should link to)
`wikiloom review`	List low-confidence link candidates from `pending.json`
`wikiloom review --accept-all`	Insert every pending link into its source page
`wikiloom review --clear`	Discard all pending candidates
`wikiloom reindex`	Regenerate root and sub-index files
`wikiloom protect`	Scan for pages whose human-edit flag drifted from git history
`wikiloom protect --sync`	Apply git truth to the manifest + frontmatter

Observability

Command	Description
`wikiloom status`	Project overview: page counts by type/status, human-edited count, backlinks, chunks, sources, last event, total tokens + cost
`wikiloom log [-n N]`	Recent LLM / system events from `wiki/log.md`, newest first
`wikiloom edits [-n N]`	Recent human edits committed via `wikiloom save` (date, author, subject, hash). Complements `wikiloom log` for multi-user audit
`wikiloom cost`	Token usage and spend breakdown by event type, with monthly budget percentage

Project structure

my-wiki/
  README.md               # Per-project orientation (domain, commands, workflow)
  wikiloom.toml           # Project config (LLM, budget, thresholds, dormant windows)
  .env                    # Your API key (gitignored, created via init prompt)
  .env.example            # Committed template showing the env var for your provider
  .wikiloom/              # Customizable templates
    schema.md             # Page schema reference
    prompts/
      ingest.md           # Synthesis prompt — iterate this for quality
      query.md            # Query prompt
      lint.md             # (reserved for future use)
    output_formats/
      ingest_response.json   # JSON schema the LLM must match
      query_response.json
  wiki/                   # The wiki itself (markdown + YAML frontmatter)
    index.md              # Root index
    log.md                # Event log (auto-appended)
    concepts/             # Concept pages
    entities/             # People, orgs, products, tools
    sources/              # One page per ingested document
    syntheses/            # Saved query answers
    decisions/            # Reserved for ADR-style decision pages
    archive/              # Deprecated pages
  raw/                    # Copies of ingested source files
    papers/   articles/   images/   code/   misc/
  _registry/              # Derived state (mostly committed; some gitignored)
    manifest.json         # Page registry (committed)
    backlinks.json        # Wikilink graph (committed)
    pending.json          # Low-confidence link candidates (committed)
    sources.json          # Content-addressed source catalog (committed)
    schema_version.json   # Schema marker for future migrations (committed)
    wiki.db               # SQLite query cache + chunks table (gitignored)
    query_history.json    # Rolling cache of past query results (gitignored)
    ingest_state.json     # Per-chunk progress checkpoint (gitignored)

Configuration

wikiloom.toml lives at the project root. All sections optional — defaults are sensible.

[project]
name = "my-wiki"
domain = "AI research"
schema_version = 1

[llm]
provider = "anthropic"
default_model = "claude-sonnet-4-6"   # Fallback for any LLM-backed command
ingest_model  = ""                    # Optional override for `wikiloom ingest`
query_model   = ""                    # Optional override for `wikiloom query`
max_tokens_per_operation = 8000
monthly_budget_usd = 50.0             # Pre-flight refuses runs that exceed this
parse_retry_count    = 2              # Retries when the LLM returns unparseable JSON; set to 0 to disable

[linking]
ner_model = "en_core_web_sm"
auto_create_stubs = false       # Whether to create stub pages for unresolved entities
high_confidence_threshold = 95
medium_confidence_threshold = 85
low_confidence_threshold = 70

[ingest]
max_file_size_mb = 50           # 0 disables
min_extracted_chars = 16        # Reject empty extractions (e.g. scanned PDFs without OCR)
enable_budget_check = true
use_page_context = true         # Per-chunk semantic retrieval before synthesis
page_context_top_k = 10

[dormant]
default_window_days = 90
entity_window_days = 180
concept_window_days = 120
synthesis_window_days = 60

[search]
engine = "grep"

[query]
history_enabled = true          # Cache successful query results in _registry/query_history.json
history_size    = 100           # How many past queries to retain (newest first; older are trimmed)

[embeddings]
provider = "fastembed"          # local, no API key needed
# provider = "openai"           # needs OPENAI_API_KEY
# provider = "sentence-transformers"  # heavier install
enabled = true

Per-page overrides go in the page's frontmatter — for example, dormant_window_days: 365 on a page makes it slower to go dormant.

Provider options

WikiLoom uses two providers at runtime: an LLM provider for synthesis and query, and an embeddings provider for semantic search and per-chunk page context retrieval. Both are configured in wikiloom.toml and can be swapped without touching code.

LLM providers

wikiloom init accepts --provider with any of the four presets below. Each preset writes the right provider + default model to wikiloom.toml and generates a matching .env.example. Behind the scenes WikiLoom delegates to litellm, so the model naming convention follows litellm's and other providers litellm supports may also work via manual config.

Provider preset	Where it runs	Requirements	Default model
`anthropic` (default)	Anthropic API	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`
`openai`	OpenAI API	`OPENAI_API_KEY`	`gpt-5`
`google`	Google AI Studio API (Gemini)	`GEMINI_API_KEY`	`gemini/gemini-2.5-pro`
`ollama`	Local machine	Ollama installed + model pulled locally	`llama3`

Anthropic (default):

wikiloom init my-wiki --provider anthropic

[llm]
provider = "anthropic"
default_model = "claude-sonnet-4-6"

OpenAI:

wikiloom init my-wiki --provider openai

[llm]
provider = "openai"
default_model = "gpt-5"

Google (Gemini):

wikiloom init my-wiki --provider google

[llm]
provider = "google"
default_model = "gemini/gemini-2.5-pro"

Ollama (local, no API key, no cost):

# 1. Install Ollama from https://ollama.com and pull a model
ollama pull llama3
ollama serve
# 2. Init with the ollama preset
wikiloom init my-wiki --provider ollama
# 3. Override model if you want something other than llama3
wikiloom init my-wiki --provider ollama --model gemma3

[llm]
provider = "ollama"
default_model = "llama3"

Split-model setup (recommended for cost): configure ingest_model to a cheap model and query_model to a stronger one. wikiloom ingest does bulk text-to-JSON synthesis that Haiku / Flash / mini-class models handle fine; wikiloom query is low-volume and benefits from the frontier reasoning of Sonnet / 2.5-pro / gpt-5.

Embedding providers

Provider	Where it runs	Requirements	Default model	Disk impact
`fastembed`	Local	Bundled with the default install	`BAAI/bge-small-en-v1.5`	~66MB, cached in user cache dir
`openai`	OpenAI API	`OPENAI_API_KEY`; `pip install openai`	`text-embedding-3-small`	none
`sentence-transformers`	Local	`pip install sentence-transformers`	`all-MiniLM-L6-v2`	~500MB on first use

The model field in [embeddings] is optional — omit it to use the provider's default (column above).

Default (fastembed):

[embeddings]
provider = "fastembed"
enabled = true

OpenAI:

[embeddings]
provider = "openai"
model = "text-embedding-3-small"  # optional; defaults to provider default
enabled = true

sentence-transformers:

[embeddings]
provider = "sentence-transformers"
model = "all-MiniLM-L6-v2"        # optional
enabled = true

To disable embeddings entirely (FTS-only search, no semantic retrieval):

[embeddings]
enabled = false

LLM and embeddings providers are independent — you can mix any LLM with any embeddings backend (e.g., Ollama LLM + fastembed embeddings for fully local operation, or Anthropic LLM + OpenAI embeddings).

Workflows

Ingest a corpus of related documents

for pdf in papers/*.pdf; do
  wikiloom ingest "$pdf"
done

# Then surface duplicates the LLM may have created
wikiloom duplicates --auto-merge      # safe singular/plural pairs
wikiloom duplicates --review          # interactive triage for the rest

Edit a page by hand

$EDITOR wiki/concepts/transformer.md
wikiloom save                         # commits as human-edit:

If you forget to save, the next writer command (ingest, lint --fix, etc.) will block with a friendly error pointing you here.

Find and merge near-duplicates

wikiloom duplicates                   # see suspect pairs with suggested winner
wikiloom merge concepts/transformer-architecture concepts/transformer

Reconcile contradictions

When ingest detects a contradiction between a new source and an existing page, the contradiction is recorded in frontmatter. To inspect and resolve:

wikiloom show concepts/foo --field contradictions
$EDITOR wiki/concepts/foo.md          # pick the right fact, remove the entry
wikiloom save

Recover from an aborted ingest

If wikiloom ingest aborts mid-way (rate limit, credit exhaustion):

# 1. Fix the underlying problem (top up API credits, wait out rate limit)
# 2. Re-run with --force to retry from scratch
wikiloom ingest path/to/paper.pdf --force

--force re-processes all chunks (including ones that succeeded the first time).

Find what already exists before writing manually

wikiloom query "what do we have on transaction posting?"
wikiloom related concepts/transactions     # semantically similar pages
wikiloom links concepts/transactions       # what's linked to/from it

Clean up periodically

wikiloom lint                         # see issues without fixing
wikiloom lint --fix                   # auto-repair what can be fixed
wikiloom dormant                      # see candidates past their window
wikiloom dormant --review             # decide which to mark
wikiloom orphans                      # pages with no links
wikiloom relink                       # re-run linker across all pages

Tips

Ask specific, well-scoped queries. Retrieval is strongest when your question shares concrete terms with page content. "What's the overdraft fee cap?" pulls the right page cleanly; "tell me about banking" returns a noisy mix and comes back with low confidence. Before querying, skim wiki/concepts/index.md and wiki/sources/index.md to see what's actually covered — you'll write sharper questions and know when a gap is a real coverage issue vs. a retrieval miss. If confidence is low, run --detail to see which sources were consulted: tangential sources mean retrieval didn't find the right pages; relevant-but-thin sources mean the wiki genuinely doesn't cover the topic yet.

Use wikiloom show for inspection. Faster than opening files:

wikiloom show concepts/foo --field sources
wikiloom show concepts/foo --field aliases
wikiloom show concepts/foo --json | jq .source_count

wikiloom save is the only git command you need. Don't git commit manually unless you really want to. WikiLoom auto-commits everything else with the right classifying prefix.

Pin permanent notes above the auto marker. This is the only place that survives wikiloom ingest <file> --force.

Ingesting many files at once. wikiloom ingest accepts more than one source. Pick the input mode that fits — they're mutually exclusive and each feeds the same sequential per-file pipeline with a three-bucket (complete / partial / failed) grand summary at the end:

wikiloom ingest a.pdf b.pdf c.pdf          # variadic positional
wikiloom ingest --batch-file paths.txt     # paths from a text file (blanks and '#' comments skipped)
wikiloom ingest --batch-dir ~/docs/        # every file in a directory (non-recursive, sorted)
find ~/docs -name '*.pdf' | wikiloom ingest --batch-file -   # paths from stdin

Failures are isolated per file: a missing path, extraction error, or rate-limit abort on one file doesn't halt the batch. The grand summary lists any partial / failed files with a retry hint.

Batches above 20 files pause for a confirmation prompt with a rough wall-clock estimate (≈5 minutes per file at max_workers=1). Pass --yes / -y to skip the prompt — required for non-interactive contexts like scripts or backgrounded runs.

For long overnight batches, redirect the stream so your terminal is free:

wikiloom ingest --batch-file paths.txt --yes > batch.log 2>&1 &
tail -f batch.log       # watch progress
wikiloom log            # per-ingest summaries, durable in git

Run wikiloom duplicates after every batch ingest. The LLM occasionally creates near-duplicates (pending-transactions vs pending-transactions-banking); catching them early keeps the wiki clean.

Listing commands are pipeable. wikiloom orphans, wikiloom dormant (both candidate and --list-marked views), wikiloom duplicates, wikiloom related <page>, wikiloom links --list, wikiloom log, and wikiloom edits all detect when stdout isn't a terminal and switch to tab-separated one line per item with no headers or tips, so shell pipelines work cleanly:

wikiloom dormant | grep concept              # only concept-type candidates
wikiloom dormant | wc -l                     # total candidate count
wikiloom orphans | head -20                  # first 20 orphans
wikiloom dormant --list-marked | cut -f1     # just page_ids
wikiloom duplicates | grep -i auth           # duplicate pairs mentioning "auth"
wikiloom log | grep ingest                   # ingest events only
wikiloom log | awk -F'\t' '{print $1, $5}'   # timestamp and cost columns

Tab-separated keeps column positions stable when fields like titles or descriptions contain spaces, so cut -f and awk -F'\t' work reliably. Action modes (--review, --auto-merge, --save, --link, --accept-all, --clear) keep their interactive or confirmation output intact. The pretty view also stays when you run commands directly in a terminal. Run wikiloom <command> --help for each command's exact column order.

Customize the synthesis prompt. Open .wikiloom/prompts/ingest.md and iterate — every page WikiLoom produces is a function of that prompt + the chunk. The default works but is generic. For domain-specific corpora, tailored prompts produce noticeably better output.

Read wiki/log.md to see what happened. Every operation appends a structured event with timestamps, token usage, and cost. Useful for cost reviews and auditing.

Browse past query answers without re-running them. Every successful wikiloom query run is appended to _registry/query_history.json (newest first, default 100 entries). wikiloom queries lists recent runs; wikiloom queries --show <id> reprints the full answer + sources without an LLM call; wikiloom queries --save <id> promotes any past entry to a synthesis page. Privacy note: the file is gitignored (per-machine cache, not project state) but stored in plaintext on disk and may contain sensitive prompts. Disable with [query] history_enabled = false in wikiloom.toml, or shrink retention via history_size.

Switching LLM providers. Either re-init in a fresh directory with --provider / --model, or edit [llm] provider + default_model (and optionally ingest_model / query_model) in wikiloom.toml directly. WikiLoom uses litellm under the hood, so any provider it supports works. The model name follows litellm's naming convention.

Backup is just git push. The whole project — wiki, manifest, source catalog, configuration — is in git. Push to any remote and you have a full backup.

Development

Running tests

pytest                       # Full suite (live-API tests skipped by default)
pytest -m live               # Run live-API tests (requires ANTHROPIC_API_KEY)
pytest tests/test_llm.py     # Just the LLM client unit tests

Extensive pytest suite. All deterministic in the default run; live API tests live in test_llm_live.py and are skipped unless explicitly requested.

Customizing prompts

Edit files under .wikiloom/prompts/. Each project's prompts override the packaged defaults. The synthesis loop loads from the project first, falls back to the package.

Customizing JSON output schemas

Edit files under .wikiloom/output_formats/. The synthesis loop validates LLM responses against ingest_response.json before accepting them. Tighten the schema to make the LLM's output more reliable.

Contributing

Issues and pull requests welcome at github.com/do-y-lee/wikiloom. For PRs: keep diffs focused, land green tests (pytest), and explain the why in the PR body — the what is in the diff. See Development above for local setup.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.9

May 4, 2026

0.1.8

May 4, 2026

This version

0.1.7

May 2, 2026

0.1.5

May 1, 2026

0.1.4

Apr 30, 2026

0.1.3

Apr 30, 2026

0.1.2

Apr 30, 2026

0.1.1

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiloom-0.1.7.tar.gz (298.7 kB view details)

Uploaded May 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

wikiloom-0.1.7-py3-none-any.whl (216.6 kB view details)

Uploaded May 2, 2026 Python 3

File details

Details for the file wikiloom-0.1.7.tar.gz.

File metadata

Download URL: wikiloom-0.1.7.tar.gz
Upload date: May 2, 2026
Size: 298.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for wikiloom-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`e01fbb72b438af8b86e1be685a669afe7774c5da777ff9fdcb1304b1a1939e90`
MD5	`fbf5591d5be1e417f85518be1e921c56`
BLAKE2b-256	`2d8ae75dc871d1c6e81ee12529b3ddc19833b932594bb1fe4680a0801771ecc4`

See more details on using hashes here.

File details

Details for the file wikiloom-0.1.7-py3-none-any.whl.

File metadata

Download URL: wikiloom-0.1.7-py3-none-any.whl
Upload date: May 2, 2026
Size: 216.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for wikiloom-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a934c90373725c1cd8fa308d045e056113564ff47f03822921ef09d23a973690`
MD5	`02471d9a66c4b00ecf132230034e9b79`
BLAKE2b-256	`f29b0764c9a5702cc21cbcac79b6888169fb51e5e725976214f7fac66ef13f0f`

See more details on using hashes here.

wikiloom 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

WikiLoom

Table of contents

How it works

Installation

From source

Quick start

Concepts you should know

Page lifecycle: active, dormant, deprecated

Two layers of human-edit protection

Structural provenance

Auto-commits and wikiloom save

Tiered linking confidence

Per-chunk page context (Layer 1)

Budget enforcement

Commands

Project lifecycle

Ingestion

Reading and exploring

Page lifecycle

Maintenance

Observability

Project structure

Configuration

Provider options

LLM providers

Embedding providers

Workflows

Ingest a corpus of related documents

Edit a page by hand

Find and merge near-duplicates

Reconcile contradictions

Recover from an aborted ingest

Find what already exists before writing manually

Clean up periodically

Tips

Development

Running tests

Customizing prompts

Customizing JSON output schemas

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Auto-commits and `wikiloom save`