Skip to main content

LLM-maintained knowledge bases with deterministic linking

Project description

WikiLoom

LLM-maintained knowledge bases with deterministic linking

tests PyPI Python License: MIT

WikiLoom turns raw documents into a persistent, compounding knowledge base. Ingest a PDF, markdown file, or URL — the LLM reads the source and writes structured wiki pages with deterministic linking, structural provenance, and human-edit protection. Every operation is committed to git automatically.

Inspired by Andrej Karpathy's LLM wiki gist.

Why WikiLoom vs. naive RAG? Instead of re-embedding documents into an opaque vector store, WikiLoom builds a persistent, human-readable knowledge graph — deterministic wikilinking, structural provenance back to source chunks, and atomic git commits on every operation.

Heads-up: WikiLoom calls paid LLM APIs by default. Anthropic, OpenAI, and Google providers cost money — typically cents per document ingested. A pre-flight budget check refuses runs that would exceed monthly_budget_usd in wikiloom.toml (default $50/mo). For zero-cost local operation, use the ollama provider — see Provider options.

Table of contents

How it works

Ingest pipeline

The LLM handles judgment (reading sources, extracting claims, assessing confidence). Everything after the LLM call is deterministic: linking, backlink graph, index regeneration, git commit. Every WikiLoom command that modifies state auto-commits with a classifying prefix (ingest:, lint:, merge:, etc.) so you never have to type git.

Installation

Supported Python versions:

  • Linux, Windows, Apple Silicon Macs: Python 3.10–3.13.
  • Intel Macs: Python 3.10–3.12. onnxruntime (a transitive dependency for embeddings) no longer publishes Intel macOS wheels for Python 3.13+.
  • Python 3.14: not yet supported on any platform — spaCy hasn't published a 3.14 wheel.
pip install wikiloom

# Required for the linking engine
python -m spacy download en_core_web_sm

API keys are managed per-project via a .env file created during wikiloom init (see Quick start). If you prefer shell exports, those still work and take precedence over .env.

From source

git clone https://github.com/do-y-lee/wikiloom.git && cd wikiloom
pip install -e ".[dev]"
python -m spacy download en_core_web_sm

Quick start

# 1. Create a project with your preferred LLM provider.
#    Presets: anthropic (default), openai, google, ollama.
#    Init prompts you to paste your API key into .env (skippable).
wikiloom init my-wiki --domain "AI research" --provider anthropic
cd my-wiki

# 2. Ingest a source
wikiloom ingest path/to/paper.pdf

# 3. See what was created
wikiloom status
ls wiki/concepts/ wiki/entities/ wiki/sources/

# 4. Ask a question
wikiloom query "What are the key contributions of this paper?"

# 5. Save the answer as a synthesis page (re-usable, queryable)
wikiloom query --save-last

# 6. Inspect a page's metadata
wikiloom show concepts/transformer

That's it. Every step above auto-commits to git.

Heads-up on first-run downloads: the default fastembed embedding model (~66MB) is downloaded once and cached in a durable per-user location (~/Library/Caches/wikiloom/fastembed on macOS, ~/.cache/wikiloom/fastembed on Linux, %LOCALAPPDATA%\wikiloom\Cache\fastembed on Windows). wikiloom init offers to fetch it up front so the slow step is predictable; if you decline (or skip with --no-interactive), the first command that needs embeddings (ingest, query, related) will download it instead. Subsequent calls reuse the cached weights. To use a different backend or disable embeddings, see Provider options.

Tip on cost: ingest is the token-heavy operation. For a significant saving, configure a cheap model for ingest and a stronger model for query reasoning in wikiloom.toml (see Configuration):

[llm]
default_model = "claude-sonnet-4-6"
ingest_model  = "claude-haiku-4-5-20251001"
query_model   = "claude-sonnet-4-6"

Concepts you should know

Page lifecycle: active, dormant, deprecated

Every page has one of three statuses:

  • active — current, in active use, surfaced everywhere
  • dormant — older than its time window, but still visible and usable. Dormant is informational ("you might want to refresh this"), not a verdict on usefulness. Marking is a user action via wikiloom dormant <page>.
  • deprecated — retired. Page moves to wiki/archive/, hidden from most workflows. Reached via wikiloom merge or wikiloom deprecate. Permanent removal via wikiloom purge (which requires deprecation first).

Lifecycle: active → dormant (optional) → deprecated → purged (gone).

Two layers of human-edit protection

When you edit a page by hand and run wikiloom save:

  1. Commit prefix (human-edit:) — a soft, short-term protection. lint --fix skips the page; auto-tools leave it alone. Cleared by the next auto-action (e.g. a re-ingest).
  2. The <!-- wikiloom:auto --> marker — a durable boundary. Anything above the marker survives every operation, including wikiloom ingest <file> --force (the only command that wipes the auto region).

For normal updates (re-ingesting a different source that updates the page), new content is appended to the auto region — your edits anywhere on the page survive. The marker only matters when you re-synthesize from scratch via --force.

Tip: to pin a permanent personal note, put it above the marker:

# Transformer

> **My note:** the original paper used post-norm; modern impls use pre-norm.

<!-- wikiloom:auto -->

## Architecture

... (LLM-generated content)

Structural provenance

Every chunk of a source document is persisted to a SQLite cache with a stable chunk_id derived from sha256(source_hash + chunk_index). Pages reference their contributing chunks under each entry in their sources frontmatter array — every source dict carries its own chunk_ids list. So you can trace every claim back to a specific chunk of a specific document.

wikiloom show concepts/transformer --field sources   # see contributing sources
wikiloom source <chunk_id>                            # see the original chunk text

Auto-commits and wikiloom save

Every command that modifies wiki content auto-commits with a classifying prefix:

Prefix Created by
init: wikiloom init
ingest: wikiloom ingest
lint: wikiloom lint --fix
relink: / review: / related: linker workflow commands
merge: / deprecate: page lifecycle commands
dormant: wikiloom dormant mark/unmark
human-edit: you, via wikiloom save after editing pages, wikiloom.toml, or prompts by hand

Writer commands also block if you have uncommitted edits under wiki/, telling you to run wikiloom save first — so manual page edits never accidentally land inside an ingest: commit. Dirty wikiloom.toml or prompt edits produce a passive nudge but don't block, since they can't collide with an auto-commit's output.

Tiered linking confidence

The linker scores each potential wikilink on a 0–100 scale:

  • High (≥ 95): auto-inserted into the page body
  • Medium (≥ 85): auto-inserted, flagged in backlinks.json
  • Low (≥ 70): deferred to pending.json for review via wikiloom review
  • Below 70: ignored

Configurable in wikiloom.toml under [linking].

Per-chunk page context (Layer 1)

When ingesting a new source, the synthesis loop embeds each chunk and retrieves the top-K most semantically similar existing pages. The LLM sees this list when deciding whether to UPDATE an existing page or CREATE a new one — reduces duplicate page creation without any code-side merging.

Disable per-run with wikiloom ingest <file> --no-page-context or per-project via [ingest] use_page_context = false.

Budget enforcement

Before running synthesis, ingest estimates the token cost and refuses if it would exceed [llm] monthly_budget_usd in wikiloom.toml. After the run, if month-to-date spend exceeds the budget, a stderr warning fires (no mid-run abort — pre-flight is the only enforcement point).

Disable with [ingest] enable_budget_check = false.

Commands

26 commands grouped by purpose. All commands accept --project <path> (defaults to walking upward from the current directory to find wikiloom.toml).

Run wikiloom --help for the command list and wikiloom <command> --help for a specific command's flags (e.g. wikiloom query --help, wikiloom ingest --help).

Project lifecycle

Command Description
wikiloom init <name> [--domain <text>] [--provider <id>] [--model <id>] [--no-interactive] Create a new project: directory tree, config, scaffolded indexes, git repo, and per-project README. --provider picks from anthropic (default), openai, google, ollama. An interactive prompt offers to paste your API key into .env; --no-interactive skips it (CI-friendly)
wikiloom save [-m "msg"] [--dry-run] Commit your manual edits with a human-edit: prefix. Covers pages under wiki/, wikiloom.toml, and prompts under .wikiloom/prompts/ — one command for every human-editable file. Auto-bumps frontmatter.modified, freshens dormant → active
wikiloom rebuild-cache Regenerate the SQLite query cache from manifest + frontmatter. Required after switching the [embeddings] provider or model so existing pages get re-embedded in the new vector space; otherwise occasional recovery tool

Ingestion

Command Description
wikiloom ingest <file-or-url> [--force] [--no-page-context] Ingest a source, synthesize pages, link, commit. --force re-runs even if the source was already ingested. --no-page-context disables per-chunk semantic retrieval for this run

What ingests well

Tier Formats Notes
Best — clean extraction, strong synthesis .md, .txt, .rst, text-based .pdf, URLs (http://, https://) Markdown / plain text are native. PDFs need a text layer — scanned PDFs extract as empty (no OCR). The URL extractor strips nav, ads, and boilerplate before synthesis.
Good — supported with caveats .docx, .pptx, code files (.py, .sql, .js, .ts, .tsx, .jsx, .go, .rs, .java, .rb, .cs, .cpp, .c, .sh), config / IaC (.yaml, .yml, .json, .toml, .dockerfile, .tf, .hcl, .proto, .graphql) Office docs flatten tables / layout and skip embedded images. Code and config ingest as plain text with language context — strong on docs, comments, and schemas; weaker on pure algorithmic code (the LLM tends to re-describe behavior rather than extract domain knowledge).
Not supported today .xls, .xlsx, .csv (large), images (.png, .jpg, .jpeg, .webp, .gif), standalone .html Excel: export to .csv or .md first. Large CSV tables don't synthesize well — small ones may work as plain text. Images currently emit a placeholder (no vision / OCR). For HTML, host at a URL and ingest the URL instead.

URL ingestion: wikiloom ingest https://example.com/page works on static HTML sites — documentation, blog posts, Wikipedia, most MkDocs/Docusaurus/Sphinx-rendered docs. The http:// or https:// scheme is required — bare hostnames like example.com/page are treated as local file paths and fail with "No such file." It does not work on:

  • JavaScript-rendered pages (React / Vue / Next.js client-side apps, most modern product pages)
  • Paywalled or login-gated content
  • Sites with bot protection / WAF (most banks, Cloudflare-protected sites)

For unsupported pages, download as PDF and ingest the PDF instead. URL ingests go through the same extract → synthesize → link → commit pipeline as files; dedup keys on the hash of the extracted text so re-ingesting the same URL with unchanged content is a cheap no-op.

Reading and exploring

Command Description
wikiloom query "<question>" [--detail] [--max-pages N] Ask a question grounded in wiki content. --detail shows sources, confidence, and last-modified per source
wikiloom query --last-detail Show detail for the most recent query (no LLM call)
wikiloom query --save-last Save the most recent answer as a wiki/syntheses/ page
wikiloom queries [--show <id>] [--save <id>] [--all] Browse the rolling cache of past query runs. Default lists the 20 most recent (id, timestamp, question snippet, confidence). --show prints the full answer + sources for an entry (no LLM call); --save promotes that entry to a synthesis page. Retention controlled by [query] history_size
wikiloom show <page> [--field <name>] [--json] Show a page's frontmatter. --field extracts one field; chunk_ids flattens across sources
wikiloom links <page> Show all pages linked to and from a given page
wikiloom related <page> [-n N] [--save] [--link] Find pages semantically similar to one. --save writes them into frontmatter; --link appends a "Related Pages" wikilink section to the body
wikiloom orphans List pages with no inbound or outbound wikilinks
wikiloom duplicates [--review] [--auto-merge] Find near-duplicate pairs by slug fuzzy match + embedding cosine. --review walks each pair interactively; --auto-merge batches obvious singular/plural variants
wikiloom source <chunk_id> Print the exact source text the LLM saw for a chunk

Page lifecycle

Command Description
wikiloom merge <loser> <winner> [--yes] Combine two pages — LOSER first, WINNER second (matches "merge X into Y"). Union bodies (preserving human regions), rewrite inbound [[loser]] wikilinks to [[winner]], deprecate the loser
wikiloom deprecate <page> [--superseded-by <other>] [--yes] Soft-remove a page: move to wiki/archive/, set status: deprecated. With --superseded-by, also rewrites every inbound [[X]] wikilink across non-archived pages to the replacement
wikiloom purge <page> [--yes] Permanently remove an already-deprecated page (deletes the archive file AND the manifest entry). Requires typed confirmation by default
wikiloom dormant List candidates (active pages past their window)
wikiloom dormant --list-marked List currently-marked dormant pages
wikiloom dormant --windows Show window config by type
wikiloom dormant <page> [--unmark] Manually mark/unmark a page as dormant
wikiloom dormant --review Walk through dormant candidates interactively

Maintenance

Command Description
wikiloom lint [--fix] Run health checks (broken links, missing frontmatter, duplicates, dormant candidates). Default is check-only — prints a report and exits 1 if issues are found. --fix applies auto-repairs (broken links, frontmatter only — never auto-marks dormant)
wikiloom relink Re-run the linker across every page (useful when new pages were added that earlier pages should link to)
wikiloom review List low-confidence link candidates from pending.json
wikiloom review --accept-all Insert every pending link into its source page
wikiloom review --clear Discard all pending candidates
wikiloom reindex Regenerate root and sub-index files
wikiloom protect Scan for pages whose human-edit flag drifted from git history
wikiloom protect --sync Apply git truth to the manifest + frontmatter

Observability

Command Description
wikiloom status Project overview: page counts by type/status, human-edited count, backlinks, chunks, sources, last event, total tokens + cost
wikiloom log [-n N] Recent LLM / system events from wiki/log.md, newest first
wikiloom edits [-n N] Recent human edits committed via wikiloom save (date, author, subject, hash). Complements wikiloom log for multi-user audit
wikiloom cost Token usage and spend breakdown by event type, with monthly budget percentage

Project structure

my-wiki/
  README.md               # Per-project orientation (domain, commands, workflow)
  wikiloom.toml           # Project config (LLM, budget, thresholds, dormant windows)
  .env                    # Your API key (gitignored, created via init prompt)
  .env.example            # Committed template showing the env var for your provider
  .wikiloom/              # Customizable templates
    schema.md             # Page schema reference
    prompts/
      ingest.md           # Synthesis prompt — iterate this for quality
      query.md            # Query prompt
      lint.md             # (reserved for future use)
    output_formats/
      ingest_response.json   # JSON schema the LLM must match
      query_response.json
  wiki/                   # The wiki itself (markdown + YAML frontmatter)
    index.md              # Root index
    log.md                # Event log (auto-appended)
    concepts/             # Concept pages
    entities/             # People, orgs, products, tools
    sources/              # One page per ingested document
    syntheses/            # Saved query answers
    decisions/            # Reserved for ADR-style decision pages
    archive/              # Deprecated pages
  raw/                    # Copies of ingested source files
    papers/   articles/   images/   code/   misc/
  _registry/              # Derived state (mostly committed; some gitignored)
    manifest.json         # Page registry (committed)
    backlinks.json        # Wikilink graph (committed)
    pending.json          # Low-confidence link candidates (committed)
    sources.json          # Content-addressed source catalog (committed)
    schema_version.json   # Schema marker for future migrations (committed)
    wiki.db               # SQLite query cache + chunks table (gitignored)
    query_history.json    # Rolling cache of past query results (gitignored)
    ingest_state.json     # Per-chunk progress checkpoint (gitignored)

Configuration

wikiloom.toml lives at the project root. All sections optional — defaults are sensible.

[project]
name = "my-wiki"
domain = "AI research"
schema_version = 1

[llm]
provider = "anthropic"
default_model = "claude-sonnet-4-6"   # Fallback for any LLM-backed command
ingest_model  = ""                    # Optional override for `wikiloom ingest`
query_model   = ""                    # Optional override for `wikiloom query`
max_tokens_per_operation = 8000
monthly_budget_usd = 50.0             # Pre-flight refuses runs that exceed this
parse_retry_count    = 2              # Retries when the LLM returns unparseable JSON; set to 0 to disable

[linking]
ner_model = "en_core_web_sm"
auto_create_stubs = false       # Whether to create stub pages for unresolved entities
high_confidence_threshold = 95
medium_confidence_threshold = 85
low_confidence_threshold = 70

[ingest]
max_file_size_mb = 50           # 0 disables
min_extracted_chars = 16        # Reject empty extractions (e.g. scanned PDFs without OCR)
enable_budget_check = true
use_page_context = true         # Per-chunk semantic retrieval before synthesis
page_context_top_k = 10

[dormant]
default_window_days = 90
entity_window_days = 180
concept_window_days = 120
synthesis_window_days = 60

[search]
engine = "grep"

[query]
history_enabled = true          # Cache successful query results in _registry/query_history.json
history_size    = 100           # How many past queries to retain (newest first; older are trimmed)

[embeddings]
provider = "fastembed"          # local, no API key needed
# provider = "openai"           # needs OPENAI_API_KEY
# provider = "sentence-transformers"  # heavier install
enabled = true

Per-page overrides go in the page's frontmatter — for example, dormant_window_days: 365 on a page makes it slower to go dormant.

Provider options

WikiLoom uses two providers at runtime: an LLM provider for synthesis and query, and an embeddings provider for semantic search and per-chunk page context retrieval. Both are configured in wikiloom.toml and can be swapped without touching code.

LLM providers

wikiloom init accepts --provider with any of the four presets below. Each preset writes the right provider + default model to wikiloom.toml and generates a matching .env.example. Behind the scenes WikiLoom delegates to litellm, so the model naming convention follows litellm's and other providers litellm supports may also work via manual config.

Provider preset Where it runs Requirements Default model
anthropic (default) Anthropic API ANTHROPIC_API_KEY claude-sonnet-4-6
openai OpenAI API OPENAI_API_KEY gpt-5
google Google AI Studio API (Gemini) GEMINI_API_KEY gemini/gemini-2.5-pro
ollama Local machine Ollama installed + model pulled locally llama3

Anthropic (default):

wikiloom init my-wiki --provider anthropic
[llm]
provider = "anthropic"
default_model = "claude-sonnet-4-6"

OpenAI:

wikiloom init my-wiki --provider openai
[llm]
provider = "openai"
default_model = "gpt-5"

Google (Gemini):

wikiloom init my-wiki --provider google
[llm]
provider = "google"
default_model = "gemini/gemini-2.5-pro"

Ollama (local, no API key, no cost):

# 1. Install Ollama from https://ollama.com and pull a model
ollama pull llama3
ollama serve
# 2. Init with the ollama preset
wikiloom init my-wiki --provider ollama
# 3. Override model if you want something other than llama3
wikiloom init my-wiki --provider ollama --model gemma3
[llm]
provider = "ollama"
default_model = "llama3"

Split-model setup (recommended for cost): configure ingest_model to a cheap model and query_model to a stronger one. wikiloom ingest does bulk text-to-JSON synthesis that Haiku / Flash / mini-class models handle fine; wikiloom query is low-volume and benefits from the frontier reasoning of Sonnet / 2.5-pro / gpt-5.

Embedding providers

Provider Where it runs Requirements Default model Disk impact
fastembed Local Bundled with the default install BAAI/bge-small-en-v1.5 ~66MB, cached in user cache dir
openai OpenAI API OPENAI_API_KEY; pip install openai text-embedding-3-small none
sentence-transformers Local pip install sentence-transformers all-MiniLM-L6-v2 ~500MB on first use

The model field in [embeddings] is optional — omit it to use the provider's default (column above).

Default (fastembed):

[embeddings]
provider = "fastembed"
enabled = true

OpenAI:

[embeddings]
provider = "openai"
model = "text-embedding-3-small"  # optional; defaults to provider default
enabled = true

sentence-transformers:

[embeddings]
provider = "sentence-transformers"
model = "all-MiniLM-L6-v2"        # optional
enabled = true

To disable embeddings entirely (FTS-only search, no semantic retrieval):

[embeddings]
enabled = false

LLM and embeddings providers are independent — you can mix any LLM with any embeddings backend (e.g., Ollama LLM + fastembed embeddings for fully local operation, or Anthropic LLM + OpenAI embeddings).

Workflows

Ingest a corpus of related documents

for pdf in papers/*.pdf; do
  wikiloom ingest "$pdf"
done

# Then surface duplicates the LLM may have created
wikiloom duplicates --auto-merge      # safe singular/plural pairs
wikiloom duplicates --review          # interactive triage for the rest

Edit a page by hand

$EDITOR wiki/concepts/transformer.md
wikiloom save                         # commits as human-edit:

If you forget to save, the next writer command (ingest, lint --fix, etc.) will block with a friendly error pointing you here.

Find and merge near-duplicates

wikiloom duplicates                   # see suspect pairs with suggested winner
wikiloom merge concepts/transformer-architecture concepts/transformer

Reconcile contradictions

When ingest detects a contradiction between a new source and an existing page, the contradiction is recorded in frontmatter. To inspect and resolve:

wikiloom show concepts/foo --field contradictions
$EDITOR wiki/concepts/foo.md          # pick the right fact, remove the entry
wikiloom save

Recover from an aborted ingest

If wikiloom ingest aborts mid-way (rate limit, credit exhaustion):

# 1. Fix the underlying problem (top up API credits, wait out rate limit)
# 2. Re-run with --force to retry from scratch
wikiloom ingest path/to/paper.pdf --force

--force re-processes all chunks (including ones that succeeded the first time).

Find what already exists before writing manually

wikiloom query "what do we have on transaction posting?"
wikiloom related concepts/transactions     # semantically similar pages
wikiloom links concepts/transactions       # what's linked to/from it

Clean up periodically

wikiloom lint                         # see issues without fixing
wikiloom lint --fix                   # auto-repair what can be fixed
wikiloom dormant                      # see candidates past their window
wikiloom dormant --review             # decide which to mark
wikiloom orphans                      # pages with no links
wikiloom relink                       # re-run linker across all pages

Tips

Ask specific, well-scoped queries. Retrieval is strongest when your question shares concrete terms with page content. "What's the overdraft fee cap?" pulls the right page cleanly; "tell me about banking" returns a noisy mix and comes back with low confidence. Before querying, skim wiki/concepts/index.md and wiki/sources/index.md to see what's actually covered — you'll write sharper questions and know when a gap is a real coverage issue vs. a retrieval miss. If confidence is low, run --detail to see which sources were consulted: tangential sources mean retrieval didn't find the right pages; relevant-but-thin sources mean the wiki genuinely doesn't cover the topic yet.

Use wikiloom show for inspection. Faster than opening files:

wikiloom show concepts/foo --field sources
wikiloom show concepts/foo --field aliases
wikiloom show concepts/foo --json | jq .source_count

wikiloom save is the only git command you need. Don't git commit manually unless you really want to. WikiLoom auto-commits everything else with the right classifying prefix.

Pin permanent notes above the auto marker. This is the only place that survives wikiloom ingest <file> --force.

Ingesting many files at once. wikiloom ingest accepts more than one source. Pick the input mode that fits — they're mutually exclusive and each feeds the same sequential per-file pipeline with a three-bucket (complete / partial / failed) grand summary at the end:

wikiloom ingest a.pdf b.pdf c.pdf          # variadic positional
wikiloom ingest --batch-file paths.txt     # paths from a text file (blanks and '#' comments skipped)
wikiloom ingest --batch-dir ~/docs/        # every file in a directory (non-recursive, sorted)
find ~/docs -name '*.pdf' | wikiloom ingest --batch-file -   # paths from stdin

Failures are isolated per file: a missing path, extraction error, or rate-limit abort on one file doesn't halt the batch. The grand summary lists any partial / failed files with a retry hint.

Batches above 20 files pause for a confirmation prompt with a rough wall-clock estimate (≈5 minutes per file at max_workers=1). Pass --yes / -y to skip the prompt — required for non-interactive contexts like scripts or backgrounded runs.

For long overnight batches, redirect the stream so your terminal is free:

wikiloom ingest --batch-file paths.txt --yes > batch.log 2>&1 &
tail -f batch.log       # watch progress
wikiloom log            # per-ingest summaries, durable in git

Run wikiloom duplicates after every batch ingest. The LLM occasionally creates near-duplicates (pending-transactions vs pending-transactions-banking); catching them early keeps the wiki clean.

Listing commands are pipeable. wikiloom orphans, wikiloom dormant (both candidate and --list-marked views), wikiloom duplicates, wikiloom related <page>, wikiloom links --list, wikiloom log, and wikiloom edits all detect when stdout isn't a terminal and switch to tab-separated one line per item with no headers or tips, so shell pipelines work cleanly:

wikiloom dormant | grep concept              # only concept-type candidates
wikiloom dormant | wc -l                     # total candidate count
wikiloom orphans | head -20                  # first 20 orphans
wikiloom dormant --list-marked | cut -f1     # just page_ids
wikiloom duplicates | grep -i auth           # duplicate pairs mentioning "auth"
wikiloom log | grep ingest                   # ingest events only
wikiloom log | awk -F'\t' '{print $1, $5}'   # timestamp and cost columns

Tab-separated keeps column positions stable when fields like titles or descriptions contain spaces, so cut -f and awk -F'\t' work reliably. Action modes (--review, --auto-merge, --save, --link, --accept-all, --clear) keep their interactive or confirmation output intact. The pretty view also stays when you run commands directly in a terminal. Run wikiloom <command> --help for each command's exact column order.

Customize the synthesis prompt. Open .wikiloom/prompts/ingest.md and iterate — every page WikiLoom produces is a function of that prompt + the chunk. The default works but is generic. For domain-specific corpora, tailored prompts produce noticeably better output.

Read wiki/log.md to see what happened. Every operation appends a structured event with timestamps, token usage, and cost. Useful for cost reviews and auditing.

Browse past query answers without re-running them. Every successful wikiloom query run is appended to _registry/query_history.json (newest first, default 100 entries). wikiloom queries lists recent runs; wikiloom queries --show <id> reprints the full answer + sources without an LLM call; wikiloom queries --save <id> promotes any past entry to a synthesis page. Privacy note: the file is gitignored (per-machine cache, not project state) but stored in plaintext on disk and may contain sensitive prompts. Disable with [query] history_enabled = false in wikiloom.toml, or shrink retention via history_size.

Switching LLM providers. Either re-init in a fresh directory with --provider / --model, or edit [llm] provider + default_model (and optionally ingest_model / query_model) in wikiloom.toml directly. WikiLoom uses litellm under the hood, so any provider it supports works. The model name follows litellm's naming convention.

Backup is just git push. The whole project — wiki, manifest, source catalog, configuration — is in git. Push to any remote and you have a full backup.

Development

Running tests

pytest                       # Full suite (live-API tests skipped by default)
pytest -m live               # Run live-API tests (requires ANTHROPIC_API_KEY)
pytest tests/test_llm.py     # Just the LLM client unit tests

Extensive pytest suite. All deterministic in the default run; live API tests live in test_llm_live.py and are skipped unless explicitly requested.

Customizing prompts

Edit files under .wikiloom/prompts/. Each project's prompts override the packaged defaults. The synthesis loop loads from the project first, falls back to the package.

Customizing JSON output schemas

Edit files under .wikiloom/output_formats/. The synthesis loop validates LLM responses against ingest_response.json before accepting them. Tighten the schema to make the LLM's output more reliable.

Contributing

Issues and pull requests welcome at github.com/do-y-lee/wikiloom. For PRs: keep diffs focused, land green tests (pytest), and explain the why in the PR body — the what is in the diff. See Development above for local setup.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikiloom-0.1.7.tar.gz (298.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wikiloom-0.1.7-py3-none-any.whl (216.6 kB view details)

Uploaded Python 3

File details

Details for the file wikiloom-0.1.7.tar.gz.

File metadata

  • Download URL: wikiloom-0.1.7.tar.gz
  • Upload date:
  • Size: 298.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for wikiloom-0.1.7.tar.gz
Algorithm Hash digest
SHA256 e01fbb72b438af8b86e1be685a669afe7774c5da777ff9fdcb1304b1a1939e90
MD5 fbf5591d5be1e417f85518be1e921c56
BLAKE2b-256 2d8ae75dc871d1c6e81ee12529b3ddc19833b932594bb1fe4680a0801771ecc4

See more details on using hashes here.

File details

Details for the file wikiloom-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: wikiloom-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 216.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for wikiloom-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 a934c90373725c1cd8fa308d045e056113564ff47f03822921ef09d23a973690
MD5 02471d9a66c4b00ecf132230034e9b79
BLAKE2b-256 f29b0764c9a5702cc21cbcac79b6888169fb51e5e725976214f7fac66ef13f0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page