OpenZIM MCP - ZIM MCP Server that enables AI models to access and search ZIM format knowledge bases offline

These details have not been verified by PyPI

Project description

OpenZIM MCP Logo

OpenZIM MCP Server

Transform static ZIM archives into dynamic knowledge engines for AI models

🆕 NEW in v1.1.0: Structured tool output! All 17 JSON-returning tools now emit MCP structuredContent alongside the legacy text envelope — no more double-stringified JSON, no more escape soup. Plus a major namespace-handling fix for new-scheme archives (list_namespaces / browse_namespace / walk_namespace were silently broken on Wikipedia-style ZIMs), pagination for extract_article_links, case-insensitive find_entry_by_title with proper scoring, and CORS support for browser MCP clients. Learn more →

Still on the v1.0.0 highlights? Streamable HTTP transport, batch entry retrieval, and per-entry resources are documented in the v1.0.0 section.

Dual Mode Support: Choose between Simple mode (1 intelligent natural language tool, default) or Advanced mode (21 specialized tools, plus 3 MCP prompts and 3 MCP resources) to match your LLM's capabilities.

Built for LLM Intelligence

OpenZIM MCP transforms static ZIM archives into dynamic knowledge engines for Large Language Models. Unlike basic file readers, this tool provides intelligent, structured access that LLMs need to effectively navigate and understand vast knowledge repositories.

Why LLMs Love OpenZIM MCP:

Smart Navigation: Browse by namespace (articles, metadata, media) instead of blind searching
Context-Aware Discovery: Get article structure, relationships, and metadata for deeper understanding
Intelligent Search: Advanced filtering, auto-complete suggestions, and relevance-ranked results
Performance Optimized: Cached operations and pagination prevent timeouts on massive archives
Relationship Mapping: Extract internal/external links to understand content connections

Whether you're building a research assistant, knowledge chatbot, or content analysis system, OpenZIM MCP gives your LLM the structured access patterns it needs to unlock the full potential of offline knowledge archives. No more fumbling through raw text dumps!

OpenZIM MCP is a modern, secure, and high-performance MCP (Model Context Protocol) server that enables AI models to access and search ZIM format knowledge bases offline.

ZIM (Zeno IMproved) is an open file format developed by the openZIM project, designed specifically for offline storage and access to website content. The format supports high compression rates using Zstandard compression (default since 2021) and enables fast full-text searching, making it ideal for storing entire Wikipedia content and other large reference materials in relatively compact files. The openZIM project is sponsored by Wikimedia CH and supported by the Wikimedia Foundation, ensuring the format's continued development and adoption for offline knowledge access, especially in environments without reliable internet connectivity.

Features

Dual Mode Support: Choose between Simple mode (1 intelligent natural language tool, default) or Advanced mode (21 specialized tools)
Streamable HTTP Transport: 🆕 Run as a long-running service over HTTP — bearer-token auth, CORS, health endpoints, multi-arch Docker image, and resource subscriptions
Batch Entry Retrieval: 🆕 Fetch up to 50 entries per call with get_zim_entries — pairs naturally with HTTP, where round-trip cost matters
Per-Entry MCP Resources: 🆕 Stream individual entries via zim://{name}/entry/{path} with native MIME types — browse HTML, PDFs, and images directly
Resource Subscriptions: 🆕 Clients subscribe to zim://files and zim://{name} and receive notifications/resources/updated when archives change
Multi-Archive Search: Search every ZIM file at once with search_all — no need to know which archive holds the answer
MCP Prompts: Pre-built workflow slash commands (/research, /summarize, /explore) that orchestrate multi-step ZIM operations
Find Entries by Title: Resolve titles to entry paths instantly with find_entry_by_title — case-insensitive, optionally cross-file
Binary Content Retrieval: Extract PDFs, images, videos, and other embedded media for multi-agent workflows
Security First: Comprehensive input validation and path traversal protection
High Performance: Intelligent caching and optimized ZIM file operations
Smart Retrieval: Automatic fallback from direct access to search-based retrieval for reliable entry access
Well Tested: 80%+ test coverage with comprehensive test suite
Modern Architecture: Modular design with dependency injection
Type Safe: Full type annotations throughout the codebase
Configurable: Flexible configuration with validation
Observable: Structured logging and health monitoring

What's new in v1.2.0

Compact mode for `zim_query` (default in simple mode)

Simple mode is intended for small / on-device LLMs whose response budgets get blown apart by Wikipedia-scale tool output. A typical extract_article_links response on a "Photosynthesis"-class article is ~36 KB; a single search response with five 3,000-char snippets is ~15 KB; a structure response with 10 sections × preview chunks is ~17 KB. None of that prose is what a small LLM does anything with — it just costs context.

zim_query now accepts a compact: bool = True parameter that switches five intents from JSON to a flat markdown rendering, truncates search snippets to 250 chars, strips Wikipedia-style markdown link syntax ([text](href "tooltip") → text) from article bodies, and applies a hard 6,000-char cap on the final response. On a typical Wikipedia query path the response shrinks ~3-6× while preserving every navigation hook a small LLM uses to drive a follow-up tool call.

Migration note

This is a behavior change for callers that programmatically parsed the legacy JSON shapes returned by query="links in X", query="structure of X", query="find article titled X", query="articles related to X", query="walk namespace X", or query="list namespaces". The previous shapes are still available — pass compact=False to opt out:

zim_query("links in Photosynthesis", options={"compact": False})

Other tool improvements

tell_me_about auto-fetches the article on a strong title match instead of returning a low-confidence search list, and returns the lead section + a section TOC rather than the full body.
The bare-topic gate now accepts non-Latin script topic names (Chinese, Cyrillic, Arabic, Devanagari, Hebrew) — previously the ASCII-only tokenizer silently rejected them and 量子力学 fell through to a low-confidence search.
Conversational filler / meta-instructions ("do both", "try again", "test this tool", "ok") return a short playbook of starter queries instead of a 200k-hit search dominated by stop-word collisions.
0-result searches surface the recovery paths (suggestions for ..., find article titled ...) inline.
The four search-style intents share a compact pagination footer in rendered output.
Hallucinated zim_file_path values ("wikipedia.zim" against an archive at /data/wiki_en.zim) are resolved by basename match rather than rejected with "Access denied".

Polish

ReDoS protection: the markdown link-strip and snippet-truncation regexes now run through the same threading-based timeout wrapper used by the intent parser, so an adversarial unclosed [text](URL cannot cause catastrophic backtracking.
The compact rendering layer moved to its own module (openzim_mcp.compact_renderers) — simple_tools.py is now focused on intent dispatch.

What's new in v2.0.0a2

Phase B of the v2 effort. Introduces the shared response contract for all list-returning tools — a wire-format break from v1.x. Clients must be updated before upgrading.

Response contract (v2)

Every list-returning tool returns the same five contract keys:

Key	Type	Meaning
`results`	`list[T]`	The page of items. Empty list on zero hits.
`next_cursor`	`str \| null`	Opaque base64-JSON cursor. Pass back as `cursor=` to fetch the next page. `null` on the last page.
`total`	`int \| null`	Total count across all pages. `null` when not knowable mid-scan (e.g., `walk_namespace`).
`done`	`bool`	`true` when no more pages exist. Always co-varies with `next_cursor`: `done=true` ⟺ `next_cursor=null`.
`page_info`	`{offset, limit, returned_count, total_is_lower_bound?}`	Pagination state for this page.

Plus the Phase A _meta envelope as a sibling.

Pagination input

Every paged tool accepts cursor (preferred) or offset (convenience). If both are supplied, cursor wins.

Cursor format

Cursors are URL-safe base64-encoded JSON of {v: 1, t: <tool_name>, s: <state>}. The cursor is tool-bound — passing one tool's cursor to another raises a clear error. Cursors are opaque — clients should not interpret them.

Tools without natural pagination

Tools like find_entry_by_title, get_search_suggestions, and list_zim_files return all matches in one call. They still emit the contract: done=true, next_cursor=null, total=len(results).

`extract_article_links` requires `kind`

extract_article_links returns one category per call (kind=internal by default). Use kind=external or kind=media for the other categories. Per-category counts surface in category_totals: {internal, external, media}.

Wire-format break note

v2.0.0a2 is a wire-format break from v1.x — see CHANGELOG for the per-tool key renames.

What's new in v2.0.0a1

First v2 pre-release. Phase A of the multi-phase v2 effort. All changes additive at the tool-signature layer; small compact-mode prose change for empty search results.

Response metadata (`_meta` envelope)

Every dict-returning tool now includes a _meta key:

{
  "tokens_est": 4283,
  "chars": 17034,
  "truncated": true,
  "more_at_offset": 17000,
  "total_chars": 87421,
  "suggestions": [
    {"type": "alt_spelling", "value": "Photosynthesis"}
  ],
  "reason": "0_hits"
}

tokens_est is a tiktoken cl100k_base estimate plus a 5% pad. Use it for context budgeting; it's accurate to ±10% across common tokenizers. suggestions and reason are present only on empty / low-confidence results.

Compact-mode prose footer

Simple-mode responses end with a single-line markdown blockquote:

> ~4.2K tokens · 17K of 87K chars · pass `offset=17000` for more

Empty results render the suggestions inline:

> No results. Try: `suggestions for Photosynthesis` · `search photosynthesis chlorophyll` · or try ZIM `wikipedia_en_all`

Set OPENZIM_MCP_META__FOOTER_ENABLED=false to suppress.

Compact-mode infobox & table handling

In compact=True (the simple-mode default):

Wikipedia-style .infobox / .vcard tables become a Markdown KV list prepended to the body. Capped at 30 rows; configurable via OPENZIM_MCP_CONTENT__INFOBOX_KV_LIMIT.
Tables with more than 8 rows or 600 characters of text become a placeholder: [Table N: 47 rows × 6 cols — pass compact=False to expand]. Thresholds configurable via OPENZIM_MCP_CONTENT__TABLE_ROW_THRESHOLD and OPENZIM_MCP_CONTENT__TABLE_CHAR_THRESHOLD.

compact=False retains v1.2.0 byte-identical behavior.

Typo-tolerant title lookup

find_entry_by_title falls back to single-edit variants (transposition, single-character deletion) when neither direct path lookup nor the libzim suggestion index returns a high-confidence match. Fuzzy-corrected hits score 0.85 (configurable via OPENZIM_MCP_SEARCH__FUZZY_TITLE_SCORE_PENALTY) to ensure exact matches always rank higher. The minimum query length to trigger the fallback is 4 characters (OPENZIM_MCP_SEARCH__FUZZY_TITLE_MIN_QUERY_LEN).

v2 Phase A env vars

Env var	Default	Purpose
`OPENZIM_MCP_META__FOOTER_ENABLED`	`true`	Append prose footer in compact mode
`OPENZIM_MCP_META__TOKENIZER_ENCODING`	`cl100k_base`	tiktoken encoding for `tokens_est`
`OPENZIM_MCP_SEARCH__STRUCTURED_SUGGESTIONS_LIMIT`	`5`	Cap on `_meta.suggestions[]` length
`OPENZIM_MCP_SEARCH__FUZZY_TITLE_MIN_QUERY_LEN`	`4`	Minimum query length for fuzzy fallback
`OPENZIM_MCP_SEARCH__FUZZY_TITLE_SCORE_PENALTY`	`0.85`	Score multiplier for fuzzy hits
`OPENZIM_MCP_CONTENT__TABLE_ROW_THRESHOLD`	`8`	Replace tables with more rows in compact mode
`OPENZIM_MCP_CONTENT__TABLE_CHAR_THRESHOLD`	`600`	Replace tables with more chars in compact mode
`OPENZIM_MCP_CONTENT__INFOBOX_KV_LIMIT`	`30`	Cap on infobox rows extracted

What's new in v1.1.0

Structured tool output

The 17 JSON-returning tools now emit MCP structuredContent alongside the legacy content[].text envelope. Old clients keep parsing the text JSON; new clients read the dict directly. The biggest beneficiary is search_all, whose per_file[].result field used to be a pre-rendered markdown blob escaped twice through json.dumps — it's now a real nested dict.

The four prose/markdown tools (search_zim_file, search_with_filters, get_zim_entry, get_main_page) and the simple-mode zim_query stay on -> str by design.

Namespace handling, fixed

In new-scheme ZIM archives (the modern format used by current Kiwix Wikipedia builds), libzim's iterable surface only exposes the C namespace and reaches metadata through archive.metadata_keys. The previous code parsed the first character of each entry path as the namespace, so Evolution looked like namespace 'E', Bob_Dylan like 'B', favicon.png like 'F' — including emoji buckets like '🐜'. search_with_filters(namespace='C') was silently dropping ~95% of legitimate hits.

list_namespaces, browse_namespace, walk_namespace, and the namespace= filter now branch on archive.has_new_namespace_scheme: new-scheme C uses entry_count as an authoritative total, M is enumerated from metadata_keys, W is surfaced via has_main_entry / has_illustration. Old-scheme archives are unaffected.

Pagination for `extract_article_links`

extract_article_links previously dumped every internal/external/media link in one call. On a heavily-linked Wikipedia article (~6k links) that was ~400 KB and overflowed the response token budget. The tool now accepts limit / offset / kind parameters; full counts ship in total_internal_links / total_external_links / total_media_links so callers can size the next page. Parsed extraction is cached once per entry and sliced in-memory (~40× speedup on cached pages).

Smarter `find_entry_by_title`

The fast path was case-sensitive, so "evolution" against an archive titled "Evolution" missed and fell through to suggestion fallback with a hardcoded score: 0.8. Now the fast path tries five case variants × C/A namespaces; suggestion results get rank-derived scores in (0, 0.95] so an exact case-insensitive match (promoted to 1.0) always outranks partials.

Per-entry resource size caps

zim://{name}/entry/{path} now caps text bodies at 256 KB UTF-8 with a notice pointing at get_zim_entry for paged reads. Oversize binary bodies are refused (not silently clipped, since a sliced PDF/PNG won't open) — callers should use get_binary_entry, which has explicit max_size_bytes and a truncated flag.

Simple mode actually works

_register_simple_tools was also calling _register_advanced_tools, so simple-mode clients received every advanced tool's schema in the prompt anyway — defeating the entire point of the mode and inflating prefill into the multi-thousand-token range. Fixed: simple mode now registers exactly one tool (zim_query). Confirmed against llama.cpp's MCP webui — single-turn prompt size dropped from ~6,200 tokens to ~1,100.

CORS for browser MCP clients

Two additions to the HTTP transport's CORS layer:

MCP-Protocol-Version is now in allow_headers. Browser MCP clients send this on every post-init request per the MCP spec; without it, the second preflight returned 400 Disallowed CORS headers and the connection dropped.
DELETE is now in allow_methods. The MCP streamable-HTTP SDK uses DELETE for explicit session termination.

Polish

get_server_health now reports a real started_at and uptime_seconds instead of "unknown".
Configuration redaction format changed from the misleading ...data to unambiguous <redacted>/data.
Server-tools timestamps go through a single _utc_now_iso() helper so a response no longer mixes timezone-aware UTC with naive local.
zim_query("") rejects empty input upfront with example queries instead of falling through to a no-op search.
get_search_suggestions schema now documents the 2-character minimum.
The "cache hit rate is low" warning waits until ≥50 accesses before commenting (previously fired at 22% during normal session warm-up).
get_zim_entry's truncation tail now reads "of body content" so callers can tell the limit applies to the body, not the wrapper headers.

What's new in v1.0.0

Streamable HTTP transport

Run OpenZIM MCP as a long-running service. Pass --transport http (or set OPENZIM_MCP_TRANSPORT=http) and the server boots a Starlette app on 127.0.0.1:8000 by default with:

Bearer-token auth — set OPENZIM_MCP_AUTH_TOKEN; comparison is timing-safe and the attempted token is never logged.
Safe-default startup check — the server refuses to bind a non-localhost host without a token. (Bind 127.0.0.1 for local-only access; put a reverse proxy in front for TLS.)
CORS allow-list — explicit origins via OPENZIM_MCP_CORS_ORIGINS; wildcard * is rejected at startup.
Health endpoints — /healthz (liveness) and /readyz (at least one allowed dir is readable). Both exempt from auth so probes work cleanly.
Multi-arch Docker image — ghcr.io/cameronrye/openzim-mcp:1.1.2, builds for linux/amd64 and linux/arm64, runs as non-root.

Legacy SSE transport is also available via --transport sse (or OPENZIM_MCP_TRANSPORT=sse) for clients that haven't migrated to streamable-HTTP. SSE does not apply the bearer-token / CORS / health-endpoint middleware, so the server refuses to start with --transport sse bound to anything other than 127.0.0.1/::1/localhost. For exposed deployments use --transport http.

Batch entry retrieval

get_zim_entries fetches up to 50 entries in one call. Per-entry failures don't abort the batch — each result includes its index from the input order plus either content (success) or error (failure). Different zim_file_path values are allowed in one batch, so a multi-archive workflow can fan out from a single search. Single-archive batches can pass bare path strings paired with a top-level zim_file_path default, so the call site stays flat instead of dict-heavy.

Per-entry MCP resources

zim://{name}/entry/{path} exposes individual entries with their native MIME type:

HTML and text entries return text bodies (text/html, text/plain, application/json, ...).
Binary entries (images, PDFs) return raw bytes (FastMCP base64-wraps them).

Encoding requirement: clients MUST URL-encode / as %2F in the {path} segment. FastMCP's URI template engine treats / as a segment separator, so a literal slash won't route. Example: zim://wikipedia_en/entry/C%2FClimate_change. (This is a constraint of the current mcp[cli] SDK.)

Resource subscriptions

Subscribe to zim://files or zim://{name} and the server emits notifications/resources/updated whenever the directory contents change or a .zim file is replaced. Polling interval is configurable (OPENZIM_MCP_WATCH_INTERVAL_SECONDS, default 5 s) and the feature can be disabled with OPENZIM_MCP_SUBSCRIPTIONS_ENABLED=false. Implementation note: this depends on a private FastMCP attribute (_mcp_server) for handler registration.

Polish & fixes

Smarter archive handling

get_related_articles resolves relative hrefs against the source entry's directory and identifies the content namespace correctly on domain-scheme archives (previously returned nothing).
Suggestion fallback uses SuggestionSearcher(archive).suggest(text) (the prior archive.suggest() call didn't exist).
list_zim_files gains a case-insensitive name_filter substring argument; one shared cache slot regardless of filter value.
search_zim_file accepts an opaque cursor parameter; passing the cursor alone resumes pagination without restating the query.

Cleaner content extraction

Heading-id resolution falls through id → mw-headline anchor → preceding <a name=""> → slug, returning (id, source) so consumers can distinguish real anchors from synthetic slugs.
Summary extraction skips USWDS banners and skip-nav blocks above the first <h1> (MedlinePlus / NIH / NIST style sites).
Link extraction drops non-navigable schemes (javascript:, mailto:, tel:, data:, blob:, vbscript:).
Per-entry paths sanitized in get_zim_entries.

Server hygiene

__version__ reads from importlib.metadata; serverInfo.version reports openzim-mcp's actual version (no longer the FastMCP SDK default).
HTTP transport's subscription watcher starts via wrapped lifespan.
Per-entry zim:// returns libzim's native MIME (was returning a placeholder).

Streamlined scope

v1.0.0 reduces the advanced-mode tool surface from 27 to 21 by removing administrative/inspection helpers that didn't pull their weight: warm_cache, cache_stats, cache_clear, get_random_entry, diagnose_server_state, and resolve_server_conflicts. The cache itself remains; the explicit management tools were dropped. Multi-instance conflict tracking was removed entirely — instance_tracker.py is gone — which means HTTP server instances coexist freely without configuration warnings.

Review pass

End-to-end review pass before tagging: tightened path/PID redaction in error and diagnostics responses, locked OPTIONS /mcp behind auth, fixed cache poisoning on transient libzim errors, resolved redirects before rendering with cycle detection, preserved Unicode in heading slugs (Arabic, Chinese, Cyrillic, Japanese), made rate-limiting atomic, and split zim_operations.py into a zim/ package via mixin classes.

What's new in v0.9.0

Multi-archive search

search_all queries every ZIM file in your allowed directories at once and merges the results — no need to know which archive holds the answer.

MCP Prompts

Three pre-built workflows you can invoke as slash commands in MCP-aware clients:

/research <topic> — search across all archives, then drill into top hits
/summarize <zim_file_path> <entry_path> — TOC + summary + key links
/explore <zim_file_path> — high-level briefing of a ZIM's contents

Find entries by title

find_entry_by_title resolves a title (or partial title) to one or more entry paths, with case-insensitive matching. Cheaper than full-text search when you already know the article name.

Power-user tools

walk_namespace — deterministic cursor-paginated namespace iteration (vs. browse_namespace which samples)
get_related_articles — outbound link-graph neighbours of a given entry

MCP Resources

First use of the MCP resources primitive — your client's resource browser and @-mention picker now see ZIM files directly:

zim://files — index of all available ZIM files
zim://{name} — overview of one ZIM (metadata, namespaces, main page preview)
zim://{name}/entry/{path} (new in 1.0.0) — single entry served with native MIME type (clients must URL-encode / as %2F in the path segment)

Reliability fixes

Namespace listing now deterministically surfaces minority namespaces (M, W, X, I) that random sampling could miss
Search filtering uses streaming scan instead of a hard 1000-hit cap (rare-mime-type filters now return matches that were previously hidden)
Error messages route by failure mode first (no more "check disk space" for "entry not found")

Quick Start

Installation

# Install from PyPI as an isolated CLI tool (recommended)
uv tool install openzim-mcp

# Or install into your current environment with pip
pip install openzim-mcp

Development Installation

For contributors and developers:

# Clone the repository
git clone https://github.com/cameronrye/openzim-mcp.git
cd openzim-mcp

# Install dependencies
uv sync

# Install development dependencies
uv sync --dev

Prepare ZIM Files

Download ZIM files (e.g., Wikipedia, Wiktionary, etc.) from the Kiwix Library and place them in a directory:

mkdir ~/zim-files
# Download ZIM files to ~/zim-files/

Running the Server

# Simple mode (default) - 1 intelligent natural language tool
openzim-mcp /path/to/zim/files
python -m openzim_mcp /path/to/zim/files

# Advanced mode - all 21 specialized tools
openzim-mcp --mode advanced /path/to/zim/files
python -m openzim_mcp --mode advanced /path/to/zim/files

# For development (from source)
uv run python -m openzim_mcp /path/to/zim/files
uv run python -m openzim_mcp --mode advanced /path/to/zim/files

# Or using make (development)
make run ZIM_DIR=/path/to/zim/files

Tool Modes

OpenZIM MCP supports two modes:

Simple Mode (default): Provides 1 intelligent tool (zim_query) that accepts natural language queries
Advanced Mode: Exposes all 21 specialized MCP tools for maximum control

MCP Configuration

Add the appropriate snippet to your MCP client's config file (claude_desktop_config.json, Cursor's MCP settings, etc.). The mcpServers wrapper is required by Claude Desktop, Cursor, and most other MCP clients.

Simple Mode (default):

{
  "mcpServers": {
    "openzim-mcp": {
      "command": "openzim-mcp",
      "args": ["/path/to/zim/files"]
    }
  }
}

Advanced Mode:

{
  "mcpServers": {
    "openzim-mcp-advanced": {
      "command": "openzim-mcp",
      "args": ["--mode", "advanced", "/path/to/zim/files"]
    }
  }
}

Alternative configuration using Python module:

{
  "mcpServers": {
    "openzim-mcp": {
      "command": "python",
      "args": [
        "-m",
        "openzim_mcp",
        "/path/to/zim/files"
      ]
    }
  }
}

For development (from source):

{
  "mcpServers": {
    "openzim-mcp": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/openzim-mcp",
        "run",
        "python",
        "-m",
        "openzim_mcp",
        "/path/to/zim/files"
      ]
    }
  }
}

Development

Running Tests

# Run all tests
make test

# Run tests with coverage
make test-cov

# Run specific test file
uv run pytest tests/test_security.py -v

# Run tests with ZIM test data (comprehensive testing)
make test-with-zim-data

# Run integration tests only
make test-integration

# Run tests that require ZIM test data
make test-requires-zim-data

ZIM Test Data Integration

OpenZIM MCP integrates with the official zim-testing-suite for comprehensive testing with real ZIM files:

# Download essential test files (basic testing)
make download-test-data

# Download all test files (comprehensive testing)
make download-test-data-all

# List available test files
make list-test-data

# Clean downloaded test data
make clean-test-data

The test data includes:

Basic files: Small ZIM files for essential testing
Real content: Actual Wikipedia/Wikibooks content for integration testing
Invalid files: Malformed ZIM files for error handling testing
Special cases: Embedded content, split files, and edge cases

Test files are automatically organized by category and priority level.

Code Quality

# Format code
make format

# Run linting
make lint

# Type checking
make type-check

# Run all checks
make check

Project Structure

openzim-mcp/
├── openzim_mcp/                # Main package
│   ├── __init__.py             # Package init, exports __version__ via importlib.metadata
│   ├── __main__.py             # Module entry point (`python -m openzim_mcp`)
│   ├── main.py                 # CLI entry point and arg parsing
│   ├── server.py               # MCP server setup, transport selection
│   ├── http_app.py             # Streamable HTTP / SSE transport, auth, CORS, health
│   ├── config.py               # Pydantic config + env var bindings
│   ├── defaults.py             # Default values and tunables
│   ├── security.py             # Path validation, traversal protection, sanitization
│   ├── error_messages.py       # User-facing error message catalog
│   ├── exceptions.py           # Custom exception hierarchy
│   ├── cache.py                # LRU cache with TTL
│   ├── rate_limiter.py         # Per-client + global token-bucket rate limiting
│   ├── content_processor.py    # HTML→text, heading-id, link extraction
│   ├── async_operations.py     # asyncio helpers and timeouts
│   ├── timeout_utils.py        # Timeout primitives
│   ├── subscriptions.py        # MtimeWatcher and SubscriberRegistry
│   ├── simple_tools.py         # Simple-mode `zim_query` tool
│   ├── intent_parser.py        # Natural-language intent parsing
│   ├── types.py                # Shared TypedDicts
│   ├── constants.py            # Shared constants
│   ├── zim_operations.py       # Backward-compat shim re-exporting from zim/ package
│   ├── zim/                    # ZIM access (split from monolithic zim_operations.py)
│   │   ├── __init__.py         # ZimOperations facade composed of mixins
│   │   ├── archive.py          # Archive open/close, file listing, name resolution
│   │   ├── content.py          # Entry retrieval, summaries, batch get
│   │   ├── namespace.py        # Namespace listing, browse, walk
│   │   ├── search.py           # Full-text + suggestion search; cursor pagination
│   │   └── structure.py        # Article structure, links, related articles
│   └── tools/                  # MCP tool registrations
│       ├── __init__.py
│       ├── file_tools.py       # list_zim_files
│       ├── content_tools.py    # get_zim_entry, get_zim_entries
│       ├── search_tools.py     # search_zim_file, search_all, find_entry_by_title
│       ├── navigation_tools.py # browse_namespace, walk_namespace, search_with_filters, get_search_suggestions
│       ├── structure_tools.py  # get_article_structure, extract_article_links, get_entry_summary, get_table_of_contents, get_binary_entry
│       ├── metadata_tools.py   # get_zim_metadata, get_main_page, list_namespaces
│       ├── server_tools.py     # get_server_health, get_server_configuration
│       ├── resource_tools.py   # MCP resources (zim://files, zim://{name}/...)
│       └── prompts.py          # MCP prompts (/research, /summarize, /explore)
├── tests/                      # Test suite (pytest)
├── website/                    # GitHub Pages site source
├── pyproject.toml              # Project configuration
├── Makefile                    # Development commands
├── Dockerfile                  # Multi-stage container build
└── README.md                   # This file

API Reference

Available Tools

list_zim_files - List all ZIM files in allowed directories

Optional parameters:

name_filter (string, default: ""): Case-insensitive substring; only files whose filename contains it are returned. Empty string lists everything. Useful for narrowing large listings (e.g. "wikipedia", "nginx").

search_zim_file - Search within ZIM file content

Required parameters:

zim_file_path (string): Path to the ZIM file
query (string): Search query term — required unless cursor is provided.

Optional parameters:

limit (integer, default: 10): Maximum number of results to return
offset (integer, default: 0): Starting offset for results (for pagination)
cursor (string): Opaque pagination token from a previous result's next_cursor. When provided, overrides offset/limit with the values encoded in the token, and supplies query if it was not given explicitly. Cursors are only valid for the query they were issued for.

get_zim_entry - Get detailed content of a specific entry in a ZIM file

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'A/Some_Article'

Optional parameters:

max_content_length (integer, default: 100000, minimum: 1000): Maximum length of returned content

Smart Retrieval Features:

Automatic Fallback: If direct path access fails, automatically searches for the entry and uses the exact path found
Path Mapping Cache: Caches successful path mappings for improved performance on repeated access
Enhanced Error Guidance: Provides clear guidance when entries cannot be found, suggesting alternative approaches
Transparent Operation: Works seamlessly regardless of path encoding differences (spaces vs underscores, URL encoding, etc.)

get_zim_entries - Batch retrieve multiple ZIM entries in one call

Pairs naturally with HTTP transport, where round-trip cost matters. Up to 50 entries per batch. Each entry resolves independently — per-entry failures do not abort the batch.

Required parameters:

entries (list): Either a list of entry-path strings (paired with zim_file_path default) OR a list of {zim_file_path, entry_path} dicts (for multi-archive batches). Limit: 50 per batch.

Optional parameters:

zim_file_path (string): Default archive path; required if entries are bare strings, optional when each dict carries its own.
max_content_length (integer): Per-entry max content length.

Returns: JSON {"results": [...], "succeeded": N, "failed": N}. Each result includes index (input order), success, and either content or error.

Notes: Rate limit is charged per entry, not per batch (anti-bypass).

get_zim_metadata - Get ZIM file metadata from M namespace entries

Required parameters:

zim_file_path (string): Path to the ZIM file

Returns: JSON string containing ZIM metadata including entry counts, archive information, and metadata entries like title, description, language, creator, etc.

get_main_page - Get the main page entry from W namespace

Required parameters:

zim_file_path (string): Path to the ZIM file

Returns: Main page content or information about the main page entry.

list_namespaces - List available namespaces and their entry counts

Required parameters:

zim_file_path (string): Path to the ZIM file

Returns: JSON string containing namespace information with entry counts, descriptions, and sample entries for each namespace (C, M, W, X, etc.).

browse_namespace - Browse entries in a specific namespace with pagination

Required parameters:

zim_file_path (string): Path to the ZIM file
namespace (string): Namespace to browse (C, M, W, X, A, I, etc.)

Optional parameters:

limit (integer, default: 50, range: 1-200): Maximum number of entries to return
offset (integer, default: 0): Starting offset for pagination

Returns: JSON string containing namespace entries with titles, content previews, and pagination information.

walk_namespace - Deterministic cursor-paginated namespace iteration

Unlike browse_namespace (which samples and may cap at 200 entries on large archives), walk_namespace scans the archive by entry ID from cursor onward. Pair the returned next_cursor with a follow-up call to walk the rest. done: true indicates iteration is complete. Use this for exhaustive enumeration — e.g. dumping every M/* metadata entry, or finding an entry whose path doesn't follow common patterns.

Required parameters:

zim_file_path (string): Path to the ZIM file
namespace (string): Namespace to walk (C, M, W, X, A, I, etc.)

Optional parameters:

cursor (integer, default: 0): Entry ID to resume from
limit (integer, default: 200, range: 1–500): Max entries per page

Returns: JSON with entries, next_cursor, and done flag.

search_with_filters - Search within ZIM file content with advanced filters

Required parameters:

zim_file_path (string): Path to the ZIM file
query (string): Search query term

Optional parameters:

namespace (string): Optional namespace filter (C, M, W, X, etc.)
content_type (string): Optional content type filter (text/html, text/plain, etc.)
limit (integer, default: 10, range: 1-100): Maximum number of results to return
offset (integer, default: 0): Starting offset for pagination

Returns: Filtered search results with namespace and content type information.

search_all - Search across every ZIM file in the allowed directories

Returns merged per-file results so the caller doesn't need to know which file holds the information. Files that can't be searched (corrupt, no full-text index) are skipped without aborting the rest.

Required parameters:

query (string): Search query term

Optional parameters:

limit_per_file (integer, default: 5, range: 1–50): Max hits per ZIM file
limit (integer): Alias for limit_per_file. If both are provided, limit_per_file wins.

Returns: JSON containing per-file result groups and counts of files searched, files-with-results, and files that failed.

find_entry_by_title - Resolve a title to one or more entry paths

Cheaper than full-text search when the caller knows the article title. Tries an exact normalized C/<Title> match first (fast path), then falls back to libzim's title-indexed suggestion search.

Required parameters:

zim_file_path (string): Path to the ZIM file (used unless cross_file=true)
title (string): Title or partial title to resolve (case-insensitive)

Optional parameters:

cross_file (boolean, default: false): If true, search across all allowed ZIM files
limit (integer, default: 10, range: 1–50): Max results to return

Returns: JSON with query, ranked results, fast_path_hit flag, and files_searched count.

get_search_suggestions - Get search suggestions and auto-complete

Required parameters:

zim_file_path (string): Path to the ZIM file
partial_query (string): Partial search query (minimum 2 characters)

Optional parameters:

limit (integer, default: 10, range: 1-50): Maximum number of suggestions to return

Returns: JSON string containing search suggestions based on article titles and content.

get_article_structure - Extract article structure and metadata

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'C/Some_Article'

Returns: JSON string containing article structure including headings, sections, metadata, and word count.

extract_article_links - Extract internal and external links from an article

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'C/Some_Article'

Returns: JSON string containing categorized links (internal, external, media) with titles and metadata.

get_related_articles - Find articles related to a given entry via outbound links

Composes extract_article_links and deduplicates internal links, returning up to limit outbound targets. (Inbound discovery was removed — it required a bounded full-archive scan that was too expensive for interactive use; reach for full-text search instead.)

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Source entry, e.g. 'C/Some_Article'

Optional parameters:

limit (integer, default: 10, range: 1–100): Max results

Returns: JSON with results.

get_entry_summary - Get a concise article summary

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'C/Some_Article'

Optional parameters:

max_words (integer, default: 200, range: 10-1000): Maximum number of words in the summary

Returns: JSON string containing a concise summary extracted from the article's opening paragraphs, with metadata including title, word count, and truncation status.

Features:

Extracts opening paragraphs while removing infoboxes, navigation, and sidebars
Provides quick article overview without loading full content
Useful for LLMs to understand article context before deciding to read more

get_table_of_contents - Extract hierarchical table of contents

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'C/Some_Article'

Returns: JSON string containing a hierarchical tree structure of article headings (h1-h6), suitable for navigation and content overview.

Features:

Hierarchical tree structure with nested children
Includes heading levels, text, and anchor IDs
Provides heading count and maximum depth statistics
Enables LLMs to navigate directly to specific sections

get_binary_entry - Retrieve binary content from a ZIM entry

Required parameters:

zim_file_path (string): Path to the ZIM file
entry_path (string): Entry path, e.g., 'I/image.png' or 'I/document.pdf'

Optional parameters:

max_size_bytes (integer): Maximum size of content to return (default: 10MB). Content larger than this will return metadata only.
include_data (boolean): If true (default), include base64-encoded data. Set to false to retrieve metadata only.

Returns:

JSON string containing:

path: Entry path in ZIM file
title: Entry title
mime_type: Content type (e.g., "application/pdf", "image/png")
size: Size in bytes
size_human: Human-readable size (e.g., "1.5 MB")
encoding: "base64" when data is included, null otherwise
data: Base64-encoded content (if include_data=true and under size limit)
truncated: Boolean indicating if content exceeded size limit

Use Cases:

Retrieve PDFs for processing with PDF parsing tools
Extract images for vision models or OCR tools
Get video/audio files for transcription services
Enable multi-agent workflows with specialized content processors

Examples

Listing ZIM files

{
  "name": "list_zim_files"
}

Response:

Found 1 ZIM files in 1 directories:

[
  {
    "name": "wikipedia_en_100_2025-08.zim",
    "path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "directory": "C:\\zim",
    "size": "310.77 MB",
    "modified": "2025-09-11T10:20:50.148427"
  }
]

Searching ZIM files

{
  "name": "search_zim_file",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "query": "biology",
    "limit": 3
  }
}

Response:

Found 51 matches for "biology", showing 1-3:

## 1. Taxonomy (biology)
Path: Taxonomy_(biology)
Snippet: #  Taxonomy (biology) Part of a series on
---
Evolutionary biology
Darwin's finches by John Gould

  * Index
  * Introduction
  * [Main](Evolution "Evolution")
  * Outline

## 2. Protein
Path: Protein
Snippet: #  Protein A representation of the 3D structure of the protein myoglobin showing turquoise α-helices. This protein was the first to have its structure solved by X-ray crystallography. Toward the right-center among the coils, a prosthetic group called a heme group (shown in gray) with a bound oxygen molecule (red).

## 3. Ant
Path: Ant
Snippet: #  Ant Ants
Temporal range: Late Aptian – Present
---
Fire ants
[Scientific classification](Taxonomy_\(biology\) "Taxonomy \(biology\)")
Kingdom:  | [Animalia](Animal "Animal")
Phylum:  | [Arthropoda](Arthropod "Arthropod")
Class:  | [Insecta](Insect "Insect")
Order:  | Hymenoptera
Infraorder:  | Aculeata
Superfamily:  |
Latreille, 1809[1]
Family:  |
Latreille, 1809

Getting ZIM entries

{
  "name": "get_zim_entry",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "Protein"
  }
}

Response:

# Protein

Path: Protein
Type: text/html
## Content

#  Protein

A representation of the 3D structure of the protein myoglobin showing turquoise α-helices. This protein was the first to have its structure solved by X-ray crystallography. Toward the right-center among the coils, a prosthetic group called a heme group (shown in gray) with a bound oxygen molecule (red).

**Proteins** are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another. Proteins differ from one another primarily in their sequence of amino acids, which is dictated by the nucleotide sequence of their genes, and which usually results in protein folding into a specific 3D structure that determines its activity.

A linear chain of amino acid residues is called a polypeptide. A protein contains at least one long polypeptide. Short polypeptides, containing less than 20–30 residues, are rarely considered to be proteins and are commonly called peptides.

... [Content truncated, total of 56,202 characters, only showing first 1,500 characters] ...

Smart Retrieval in Action

Example: Automatic path resolution

{
  "name": "get_zim_entry",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "C/Test Article"
  }
}

Response (showing smart retrieval working):

# Test Article

Requested Path: C/Test Article
Actual Path: C/Test_Article
Type: text/html

## Content

# Test Article

This article demonstrates the smart retrieval system automatically handling
path encoding differences. The system tried "C/Test Article" directly,
then automatically searched and found "C/Test_Article".

... [Content continues] ...

get_server_health - Get server health and statistics

No parameters required.

Returns:

Overall status (healthy / warning / error)
Cache performance metrics (hits, misses, hit rate, size)
Directory and ZIM-file accessibility checks
Recommendations and warnings
Sanitized configuration summary

Example Response:

{
  "timestamp": "2026-05-03T10:42:11.123456",
  "status": "healthy",
  "server_name": "openzim-mcp",
  "uptime_info": {
    "process_id": "[REDACTED]",
    "started_at": "2026-05-03T10:30:00"
  },
  "configuration": {
    "allowed_directories": 1,
    "cache_enabled": true,
    "config_hash": "abc12345..."
  },
  "cache_performance": {
    "enabled": true,
    "size": 4,
    "max_size": 100,
    "hit_rate": 0.62
  },
  "health_checks": {
    "directories_accessible": 1,
    "zim_files_found": 3,
    "permissions_ok": true
  },
  "recommendations": [],
  "warnings": []
}

get_server_configuration - Get detailed server configuration

No parameters required.

Returns: Comprehensive server configuration plus diagnostics. Sensitive fields (PIDs, raw filesystem paths) are redacted/sanitized — diagnostic output is intended to be safe to paste into bug reports.

Example Response:

{
  "configuration": {
    "server_name": "openzim-mcp",
    "allowed_directories": ["[REDACTED]/zim"],
    "allowed_directories_count": 1,
    "cache_enabled": true,
    "cache_max_size": 100,
    "cache_ttl_seconds": 3600,
    "content_max_length": 100000,
    "content_snippet_length": 1000,
    "search_default_limit": 10,
    "config_hash": "abc12345...",
    "server_pid": "[REDACTED]"
  },
  "diagnostics": {
    "validation_status": "ok",
    "warnings": [],
    "recommendations": []
  },
  "timestamp": "2026-05-03T10:42:11.123456"
}

Additional Search Examples

Computer-related search:

{
  "name": "search_zim_file",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "query": "computer",
    "limit": 2
  }
}

Response:

Found 39 matches for "computer", showing 1-2:

## 1. Video game
Path: Video_game
Snippet: #  Video game First-generation _Pong_ console at the Computerspielemuseum Berlin
---
Platforms

## 2. Protein
Path: Protein
Snippet: #  Protein A representation of the 3D structure of the protein myoglobin showing turquoise α-helices. This protein was the first to have its structure solved by X-ray crystallography. Toward the right-center among the coils, a prosthetic group called a heme group (shown in gray) with a bound oxygen molecule (red).

Getting detailed content:

{
  "name": "get_zim_entry",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "Evolution",
    "max_content_length": 1500
  }
}

Response:

# Evolution

Path: Evolution
Type: text/html
## Content

#  Evolution

Part of the Biology series on
---
****
Mechanisms and processes

  * Adaptation
  * Genetic drift
  * Gene flow
  * History of life
  * Maladaptation
  * Mutation
  * Natural selection
  * Neutral theory
  * Population genetics
  * Speciation

... [Content truncated, total of 110,237 characters, only showing first 1,500 characters] ...

Advanced Knowledge Retrieval Examples

Getting ZIM metadata:

{
  "name": "get_zim_metadata",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim"
  }
}

Response:

{
  "entry_count": 100000,
  "all_entry_count": 120000,
  "article_count": 80000,
  "media_count": 20000,
  "metadata_entries": {
    "Title": "Wikipedia (English)",
    "Description": "Wikipedia articles in English",
    "Language": "eng",
    "Creator": "Kiwix",
    "Date": "2025-08-15"
  }
}

Browsing a namespace:

{
  "name": "browse_namespace",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "namespace": "C",
    "limit": 5,
    "offset": 0
  }
}

Response:

{
  "namespace": "C",
  "total_in_namespace": 80000,
  "offset": 0,
  "limit": 5,
  "returned_count": 5,
  "has_more": true,
  "entries": [
    {
      "path": "C/Biology",
      "title": "Biology",
      "content_type": "text/html",
      "preview": "Biology is the scientific study of life..."
    }
  ]
}

Filtered search:

{
  "name": "search_with_filters",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "query": "evolution",
    "namespace": "C",
    "content_type": "text/html",
    "limit": 3
  }
}

Getting article structure:

{
  "name": "get_article_structure",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "C/Evolution"
  }
}

Response:

{
  "title": "Evolution",
  "path": "C/Evolution",
  "content_type": "text/html",
  "headings": [
    {"level": 1, "text": "Evolution", "id": "evolution"},
    {"level": 2, "text": "History", "id": "history"},
    {"level": 2, "text": "Mechanisms", "id": "mechanisms"}
  ],
  "sections": [
    {
      "title": "Evolution",
      "level": 1,
      "content_preview": "Evolution is the change in heritable traits...",
      "word_count": 150
    }
  ],
  "word_count": 5000
}

Getting article summary:

{
  "name": "get_entry_summary",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "C/Evolution",
    "max_words": 100
  }
}

Response:

{
  "title": "Evolution",
  "path": "C/Evolution",
  "content_type": "text/html",
  "summary": "Evolution is the change in heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed from parent to offspring during reproduction...",
  "word_count": 100,
  "is_truncated": true
}

Getting table of contents:

{
  "name": "get_table_of_contents",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "entry_path": "C/Evolution"
  }
}

Response:

{
  "title": "Evolution",
  "path": "C/Evolution",
  "content_type": "text/html",
  "toc": [
    {
      "level": 1,
      "text": "Evolution",
      "id": "evolution",
      "children": [
        {
          "level": 2,
          "text": "History of evolutionary thought",
          "id": "history",
          "children": []
        },
        {
          "level": 2,
          "text": "Mechanisms",
          "id": "mechanisms",
          "children": []
        }
      ]
    }
  ],
  "heading_count": 15,
  "max_depth": 4
}

Getting search suggestions:

{
  "name": "get_search_suggestions",
  "arguments": {
    "zim_file_path": "C:\\zim\\wikipedia_en_100_2025-08.zim",
    "partial_query": "bio",
    "limit": 5
  }
}

Response:

{
  "partial_query": "bio",
  "suggestions": [
    {"text": "Biology", "path": "C/Biology", "type": "title_start_match"},
    {"text": "Biochemistry", "path": "C/Biochemistry", "type": "title_start_match"},
    {"text": "Biodiversity", "path": "C/Biodiversity", "type": "title_start_match"}
  ],
  "count": 3
}

Server Management and Diagnostics Examples

Getting server health:

{
  "name": "get_server_health"
}

Response:

{
  "status": "healthy",
  "server_name": "openzim-mcp",
  "uptime_info": {
    "process_id": "[REDACTED]",
    "started_at": "2026-05-03T10:30:00"
  },
  "cache_performance": {
    "enabled": true,
    "size": 15,
    "max_size": 100,
    "hit_rate": 0.85
  }
}

ZIM Entry Retrieval Best Practices

Smart Retrieval System

OpenZIM MCP implements an intelligent entry retrieval system that automatically handles path encoding inconsistencies common in ZIM files:

How It Works:

Direct Access First: Attempts to retrieve the entry using the provided path exactly as given
Automatic Fallback: If direct access fails, automatically searches for the entry using various search terms
Path Mapping Cache: Caches successful path mappings to improve performance for repeated access
Enhanced Error Guidance: Provides clear guidance when entries cannot be found

Benefits for LLM Users:

Transparent Operation: No need to understand ZIM path encoding complexities
Single Tool Call: Eliminates the need for manual search-first methodology
Reliable Results: Consistent success across different path formats (spaces vs underscores, URL encoding, etc.)
Performance Optimized: Cached mappings improve repeated access speed

Example Scenarios Handled Automatically:

C/Test Article → C/Test_Article (space to underscore conversion)
C/Café → C/Caf%C3%A9 (URL encoding differences)
A/Some-Page → A/Some_Page (hyphen to underscore conversion)

Usage Recommendations

For Direct Entry Access:

{
  "name": "get_zim_entry",
  "arguments": {
    "zim_file_path": "/path/to/file.zim",
    "entry_path": "C/Article_Name"
  }
}

When Entry Not Found: The system will automatically provide guidance:

Entry not found: 'A/Article_Name'.
The entry path may not exist in this ZIM file.
Try using search_zim_file() to find available entries,
or browse_namespace() to explore the file structure.

Important Notes and Limitations

Content Length Requirements

The max_content_length parameter for get_zim_entry must be at least 1000 characters
Content longer than the specified limit will be truncated with a note showing the total character count

Search Behavior

Search results may include articles that contain the search terms in various contexts
Results are ranked by relevance but may not always be directly related to the primary meaning of the search term
Search snippets provide a preview of the content but may not show the exact location where the search term appears

File Format Support

Currently supports ZIM files (Zeno IMproved format)
Tested with Wikipedia ZIM files (e.g., wikipedia_en_100_2025-08.zim)
File paths must be properly escaped in JSON (use \\ for Windows paths)

Configuration

OpenZIM MCP supports configuration through environment variables with the OPENZIM_MCP_ prefix:

# Cache configuration
export OPENZIM_MCP_CACHE__ENABLED=true
export OPENZIM_MCP_CACHE__MAX_SIZE=200
export OPENZIM_MCP_CACHE__TTL_SECONDS=7200

# Content configuration
export OPENZIM_MCP_CONTENT__MAX_CONTENT_LENGTH=200000
export OPENZIM_MCP_CONTENT__SNIPPET_LENGTH=2000
export OPENZIM_MCP_CONTENT__DEFAULT_SEARCH_LIMIT=20

# Logging configuration
export OPENZIM_MCP_LOGGING__LEVEL=DEBUG
export OPENZIM_MCP_LOGGING__FORMAT="%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# Server configuration
export OPENZIM_MCP_SERVER_NAME=my_openzim_mcp_server

Configuration Options

Setting	Default	Description
`OPENZIM_MCP_TOOL_MODE`	`simple`	Tool surface: `simple` (one `zim_query` tool) or `advanced` (21 specialized tools). Controlled by `--tool-mode` on the CLI as well.
`OPENZIM_MCP_TRANSPORT`	`stdio`	Transport protocol: `stdio`, `http`, or `sse`.
`OPENZIM_MCP_HOST`	`127.0.0.1`	HTTP/SSE bind host. Non-loopback hosts require `OPENZIM_MCP_AUTH_TOKEN`.
`OPENZIM_MCP_PORT`	`8000`	HTTP/SSE bind port.
`OPENZIM_MCP_AUTH_TOKEN`	(unset)	Bearer token required when binding HTTP/SSE to a non-loopback interface.
`OPENZIM_MCP_CORS_ORIGINS`	(empty)	JSON array of allowed CORS origins for the HTTP transport. Wildcard `*` is rejected.
`OPENZIM_MCP_ALLOWED_HOSTS`	(empty)	JSON array of public-facing hostnames the HTTP transport accepts in the `Host` header (e.g. `["mcp.example.com"]`). Loopback is always allowed; this extends it for reverse-proxy and Tailscale-serve deployments. Wildcard `*` is rejected.
`OPENZIM_MCP_SUBSCRIPTIONS_ENABLED`	`true`	Enable MCP resource subscriptions (HTTP transport only). When `false`, `subscribe` calls succeed but no updates fire.
`OPENZIM_MCP_WATCH_INTERVAL_SECONDS`	`5`	Polling interval (1–60s) for the subscription mtime watcher.
`OPENZIM_MCP_CACHE__ENABLED`	`true`	Enable/disable caching
`OPENZIM_MCP_CACHE__MAX_SIZE`	`100`	Maximum cache entries
`OPENZIM_MCP_CACHE__TTL_SECONDS`	`3600`	Cache TTL in seconds
`OPENZIM_MCP_CONTENT__MAX_CONTENT_LENGTH`	`100000`	Max content length
`OPENZIM_MCP_CONTENT__SNIPPET_LENGTH`	`1000`	Max snippet length
`OPENZIM_MCP_CONTENT__DEFAULT_SEARCH_LIMIT`	`10`	Default search result limit
`OPENZIM_MCP_LOGGING__LEVEL`	`INFO`	Logging level
`OPENZIM_MCP_LOGGING__FORMAT`	`%(asctime)s - %(name)s - %(levelname)s - %(message)s`	Log message format
`OPENZIM_MCP_SERVER_NAME`	`openzim-mcp`	Server instance name

Security Features

Path Traversal Protection: Secure path validation prevents access outside allowed directories
Input Sanitization: All user inputs are validated and sanitized
Resource Management: Proper cleanup of ZIM archive resources
Error Handling: Sanitized error messages prevent information disclosure
Type Safety: Full type annotations prevent type-related vulnerabilities

Performance Features

Intelligent Caching: LRU cache with TTL for frequently accessed content
Resource Pooling: Efficient ZIM archive management
Optimized Content Processing: Fast HTML to text conversion
Lazy Loading: Components initialized only when needed
Memory Management: Proper cleanup and resource management

Testing

The project includes comprehensive testing with 80%+ coverage using both mock data and real ZIM files:

Test Categories

Unit Tests: Individual component testing with mocks
Integration Tests: End-to-end functionality testing with real ZIM files
Security Tests: Path traversal and input validation testing
Performance Tests: Cache and resource management testing
Format Compatibility: Testing with various ZIM file formats and versions
Error Handling: Testing with invalid and malformed ZIM files

Test Infrastructure

OpenZIM MCP uses a hybrid testing approach:

Mock-based tests: Fast unit tests using mocked libzim components
Real ZIM file tests: Integration tests using official zim-testing-suite files
Automatic test data management: Download and organize test files as needed

Test Data Sources

Built-in test data: Basic test files included in the repository
zim-testing-suite integration: Official test files from the OpenZIM project
Environment variable support: ZIM_TEST_DATA_DIR for custom test data locations

# Run tests with coverage report
make test-cov

# View coverage report
open htmlcov/index.html

# Run comprehensive tests with real ZIM files
make test-with-zim-data

Test Markers

Tests are organized with pytest markers:

@pytest.mark.requires_zim_data: Tests requiring ZIM test data files
@pytest.mark.integration: Integration tests
@pytest.mark.slow: Long-running tests

Monitoring

OpenZIM MCP provides built-in monitoring capabilities:

Health Checks: Server health and status monitoring
Cache Metrics: Cache hit rates and performance statistics
Structured Logging: JSON-formatted logs for easy parsing
Error Tracking: Comprehensive error logging and tracking

Versioning

This project uses Semantic Versioning with automated version management through release-please.

Automated Releases

Version bumps and releases are automated based on Conventional Commits:

feat: - New features (minor version bump)
fix: - Bug fixes (patch version bump)
feat!: or BREAKING CHANGE: - Breaking changes (major version bump)
perf: - Performance improvements (patch version bump)
docs:, style:, refactor:, test:, chore: - No version bump

Release Process

The project uses an improved, consolidated release system with automatic validation:

Automatic (Recommended): Push conventional commits → Release Please creates PR → Merge PR → Automatic release
Manual: Use GitHub Actions UI for direct control over releases
Emergency: Push tags directly for critical fixes

Key Features:

Zero-touch releases from main branch
Automatic version synchronization validation
Comprehensive testing before every release
Improved error handling and rollback capabilities
Branch protection prevents broken releases

The release flow is implemented in .github/workflows/release-please.yml and .github/workflows/release.yml.

Commit Message Format

<type>[optional scope]: <description>

[optional body]

[optional footer(s)]

Examples:

feat: add search suggestions endpoint
fix: resolve path traversal vulnerability
feat!: change API response format
docs: update installation instructions

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests (make check)
Use conventional commit messages (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guidelines
Add type hints to all functions
Write tests for new functionality
Update documentation as needed
Use conventional commit messages for automatic versioning
Ensure all tests pass before submitting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Kiwix for the ZIM format and libzim library
MCP for the Model Context Protocol
The open-source community for the excellent libraries used in this project

Made with ❤️ by Cameron Rye

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.0b13 pre-release

May 24, 2026

2.0.0b12 pre-release

May 24, 2026

2.0.0b11 pre-release

May 23, 2026

2.0.0b10 pre-release

May 23, 2026

2.0.0b9 pre-release

May 23, 2026

2.0.0b8 pre-release

May 22, 2026

2.0.0b7 pre-release

May 22, 2026

2.0.0b6 pre-release

May 22, 2026

2.0.0b4 pre-release

May 22, 2026

2.0.0b3 pre-release

May 22, 2026

2.0.0b2 pre-release

May 21, 2026

2.0.0b1 pre-release

May 21, 2026

2.0.0a25 pre-release

May 20, 2026

2.0.0a24 pre-release

May 20, 2026

2.0.0a23 pre-release

May 20, 2026

2.0.0a22 pre-release

May 20, 2026

2.0.0a21 pre-release

May 19, 2026

2.0.0a20 pre-release

May 19, 2026

2.0.0a19 pre-release

May 19, 2026

2.0.0a18 pre-release

May 19, 2026

2.0.0a17 pre-release

May 18, 2026

2.0.0a16 pre-release

May 18, 2026

2.0.0a15 pre-release

May 16, 2026

2.0.0a14 pre-release

May 16, 2026

2.0.0a13 pre-release

May 15, 2026

This version

2.0.0a12 pre-release

May 14, 2026

2.0.0a11 pre-release

May 13, 2026

2.0.0a10 pre-release

May 13, 2026

2.0.0a9 pre-release

May 12, 2026

2.0.0a8 pre-release

May 11, 2026

1.3.0

May 8, 2026

1.2.0

May 6, 2026

1.1.2

May 5, 2026

1.1.1

May 5, 2026

1.1.0

May 5, 2026

1.0.1

May 4, 2026

1.0.0

May 4, 2026

0.9.0

May 1, 2026

0.8.3

Jan 30, 2026

0.8.2

Jan 30, 2026

0.8.1

Jan 30, 2026

0.8.0

Jan 29, 2026

0.7.1

Jan 28, 2026

0.7.0

Jan 28, 2026

0.6.3

Nov 14, 2025

0.6.2

Nov 14, 2025

0.6.1

Nov 14, 2025

0.6.0

Nov 14, 2025

0.5.1

Sep 16, 2025

0.4.0

Sep 15, 2025

0.3.3

Sep 15, 2025

0.3.1

Sep 15, 2025

0.2.0

Sep 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openzim_mcp-2.0.0a12.tar.gz (574.7 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openzim_mcp-2.0.0a12-py3-none-any.whl (320.1 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file openzim_mcp-2.0.0a12.tar.gz.

File metadata

Download URL: openzim_mcp-2.0.0a12.tar.gz
Upload date: May 14, 2026
Size: 574.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openzim_mcp-2.0.0a12.tar.gz
Algorithm	Hash digest
SHA256	`f8d5cb548426ca2ae5b5c71c19828f09a127d966167f3591d21cac0f02ba6d71`
MD5	`fa496ba2a65df20b41eede5165409eef`
BLAKE2b-256	`491b5d454d5c5bf33a3a59e447c3953c5cdfcdc9ac16246073084f421ea63e08`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openzim_mcp-2.0.0a12.tar.gz:

Publisher: release.yml on cameronrye/openzim-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openzim_mcp-2.0.0a12.tar.gz
- Subject digest: f8d5cb548426ca2ae5b5c71c19828f09a127d966167f3591d21cac0f02ba6d71
- Sigstore transparency entry: 1530155860
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: cameronrye/openzim-mcp@de5c40815a0fabce77dc5542dfbd76e551f04487
- Branch / Tag: refs/tags/v2.0.0a12
- Owner: https://github.com/cameronrye
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@de5c40815a0fabce77dc5542dfbd76e551f04487
- Trigger Event: push

File details

Details for the file openzim_mcp-2.0.0a12-py3-none-any.whl.

File metadata

Download URL: openzim_mcp-2.0.0a12-py3-none-any.whl
Upload date: May 14, 2026
Size: 320.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openzim_mcp-2.0.0a12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e353b75c71c0d052e0df75071c267ff8b9272de7cbdcfc3eb3702ed219c19008`
MD5	`c6aca0609c2d5fb589f1b4fd30e90e00`
BLAKE2b-256	`c698473d62094e069d63145ef9e62ebaa48f2fb6d6cdde64ed1a7bab90ee891b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openzim_mcp-2.0.0a12-py3-none-any.whl:

Publisher: release.yml on cameronrye/openzim-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openzim_mcp-2.0.0a12-py3-none-any.whl
- Subject digest: e353b75c71c0d052e0df75071c267ff8b9272de7cbdcfc3eb3702ed219c19008
- Sigstore transparency entry: 1530155985
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: cameronrye/openzim-mcp@de5c40815a0fabce77dc5542dfbd76e551f04487
- Branch / Tag: refs/tags/v2.0.0a12
- Owner: https://github.com/cameronrye
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@de5c40815a0fabce77dc5542dfbd76e551f04487
- Trigger Event: push

openzim-mcp 2.0.0a12

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OpenZIM MCP Server

Built for LLM Intelligence

Features

What's new in v1.2.0

Compact mode for zim_query (default in simple mode)

Migration note

Other tool improvements

Polish

What's new in v2.0.0a2

Response contract (v2)

Pagination input

Cursor format

Tools without natural pagination

extract_article_links requires kind

Wire-format break note

What's new in v2.0.0a1

Response metadata (_meta envelope)

Compact-mode prose footer

Compact-mode infobox & table handling

Typo-tolerant title lookup

v2 Phase A env vars

What's new in v1.1.0

Structured tool output

Namespace handling, fixed

Pagination for extract_article_links

Smarter find_entry_by_title

Per-entry resource size caps

Simple mode actually works

CORS for browser MCP clients

Polish

What's new in v1.0.0

Streamable HTTP transport

Batch entry retrieval

Per-entry MCP resources

Resource subscriptions

Polish & fixes

What's new in v0.9.0

Multi-archive search

MCP Prompts

Find entries by title

Power-user tools

MCP Resources

Reliability fixes

Quick Start

Installation

Development Installation

Prepare ZIM Files

Running the Server

Tool Modes

MCP Configuration

Development

Running Tests

ZIM Test Data Integration

Code Quality

Project Structure

API Reference

Available Tools

list_zim_files - List all ZIM files in allowed directories

search_zim_file - Search within ZIM file content

get_zim_entry - Get detailed content of a specific entry in a ZIM file

get_zim_entries - Batch retrieve multiple ZIM entries in one call

get_zim_metadata - Get ZIM file metadata from M namespace entries

get_main_page - Get the main page entry from W namespace

list_namespaces - List available namespaces and their entry counts

browse_namespace - Browse entries in a specific namespace with pagination

walk_namespace - Deterministic cursor-paginated namespace iteration

search_with_filters - Search within ZIM file content with advanced filters

search_all - Search across every ZIM file in the allowed directories

find_entry_by_title - Resolve a title to one or more entry paths

get_search_suggestions - Get search suggestions and auto-complete

Compact mode for `zim_query` (default in simple mode)

`extract_article_links` requires `kind`

Response metadata (`_meta` envelope)

Pagination for `extract_article_links`

Smarter `find_entry_by_title`