MCP server for reading and searching EPUB/PDF documents
Project description
mcp-ebook-read
A local MCP server for Codex to read and retrieve content from EPUB/PDF documents.
One-Command Docker Setup
Qdrant (required)
docker rm -f qdrant 2>/dev/null || true && docker run -d --name qdrant -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.16.3
GROBID (required by startup preflight and document_ingest_pdf_paper)
docker rm -f grobid 2>/dev/null || true && docker run -d --name grobid -p 8070:8070 lfoppiano/grobid:0.8.0
Verify Services
curl -sS http://localhost:6333/collections
curl -sS http://localhost:8070/api/isalive
Expected:
- Qdrant returns JSON with
"status":"ok" - GROBID returns
true
Run MCP Server (PyPI via uvx)
QDRANT_URL=http://localhost:6333 GROBID_URL=http://localhost:8070 GROBID_TIMEOUT_SECONDS=120 uvx mcp-ebook-read
If startup preflight fails, the server exits with a structured error payload on stderr that includes missing env vars and setup hints.
First Run Recommendation
Before configuring this MCP inside an MCP client, run it once manually from a terminal:
QDRANT_URL=http://localhost:6333 GROBID_URL=http://localhost:8070 GROBID_TIMEOUT_SECONDS=120 uvx mcp-ebook-read
This pre-resolves and aligns runtime dependencies, which helps avoid long first-time activation latency after MCP client configuration.
When you want to refresh uvx to the latest published version, run:
QDRANT_URL=http://localhost:6333 GROBID_URL=http://localhost:8070 GROBID_TIMEOUT_SECONDS=120 uvx mcp-ebook-read@latest
If you installed the tool persistently via uv tool install, use uv tool upgrade mcp-ebook-read instead.
Environment Variables
Required:
QDRANT_URL(for examplehttp://127.0.0.1:6333)GROBID_URL(for examplehttp://127.0.0.1:8070)
Optional:
GROBID_TIMEOUT_SECONDS(default20; recommended120for large papers)QDRANT_COLLECTION(defaultmcp_ebook_read_chunks)QDRANT_TIMEOUT_SECONDS(default10)FASTEMBED_MODEL(FastEmbed model override)FASTEMBED_CACHE_PATH(FastEmbed cache root override; defaults to~/Library/Caches/mcp-ebook-read/fastembedon macOS and$XDG_CACHE_HOME/mcp-ebook-read/fastembedor~/.cache/mcp-ebook-read/fastembedelsewhere)DOCLING_FORMULA_ENRICHMENT(trueby default)PDF_FORMULA_REQUIRE_ENGINE(trueby default)PDF_FORMULA_BATCH_SIZE(autoby default; or an explicit integer)PDF_DOCLING_NUM_THREADS(override Docling CPU threads)PDF_DOCLING_BATCH_SIZE(override Docling OCR/layout/table batch sizes together)PDF_DOCLING_DEVICE(override Docling accelerator device, for exampleautoorcpu)PDF_DOCLING_TUNING_PROFILE_PATH(override the local autotune profile JSON path)
Persistence Model
- Persistence is sidecar-based and auto-routed by document location.
- For each document, MCP writes state to
<document_dir>/.mcp-ebook-read/. - Sidecar contains:
catalog.dbdocs/<doc_id>/reading/reading.mddocs/<doc_id>/assets/...docs/<doc_id>/evidence/...
Notes
- Use
library_scanto discover.pdf/.epubfiles under a root and register updates/removals. - After a fresh server restart, call
library_scan(root=...)orstorage_list_sidecars(root=...)before using tools that only takedoc_id. - Use
searchfor global semantic retrieval andreadfor locator-based chunk windows. - Startup preflight is fail-fast and requires both Qdrant and GROBID to be configured and reachable.
- FastEmbed model cache defaults to a stable per-user cache directory under
mcp-ebook-read/fastembedinstead of the system temp directory. - FastEmbed startup now performs bounded retries and clears broken per-model cache state before retrying when the local cache is corrupted or a transient download failure leaves incomplete model files behind.
- Use
document_ingest_pdf_bookto queue a background ingest job for a PDF book. - Use
document_ingest_epub_bookto queue a background ingest job for an EPUB book. - Use
document_ingest_pdf_paperto queue a background ingest job for a PDF paper. Docling remains the canonical page-aware outline; GROBID enriches paper metadata and title. - Use
document_ingest_statusto poll the current status of one ingest job (or the latest job for a document). - Use
document_ingest_list_jobsto inspect recent ingest job history for one document. - Use
document_autotune_pdf_parserbefore a long PDF ingest when you want to benchmark a few Docling thread/batch profiles on sampled pages and persist the best local profile for later runs. - Use
search_in_outline_nodewhen you need chapter-scoped retrieval (recommended for reading workflows). - Use
get_outlineto fetch document outline nodes before chapter/formula/image scoped reading. - Use
read_outline_nodeto read a chapter/outline node directly without locator stitching. - Use
render_pdf_pagefor PDF evidence rendering. - PDF image extraction is on-demand: ingest does not pre-extract PDF images.
- Use
pdf_list_imagesto trigger/list extracted PDF figure/table images (optionally scoped to one outline node). - Use
pdf_read_imageto get one extracted PDF image path plus nearby text context. - Use
pdf_book_list_formulas/pdf_book_read_formulafor formula-centric reading on PDF books. - Use
pdf_paper_list_formulas/pdf_paper_read_formulafor formula-centric reading on PDF papers. - Use
epub_list_imagesto list extracted EPUB images (optionally scoped to one outline node). - Use
epub_read_imageto get one EPUB image path plus nearby text context. - Use
storage_list_sidecarsto inspect sidecar persistence under a root. - Use
storage_delete_documentto remove one document's persisted state. - Use
storage_cleanup_sidecarsto prune missing docs/orphan artifacts and compact catalogs. - For large papers, increase
GROBID_TIMEOUT_SECONDS(for example120) to reduce timeout failures. - PDF ingest now uses a mixed formula pipeline:
- Docling structure extraction with
do_formula_enrichment. - Pix2Text as the primary formula recovery engine.
- Fail-fast when formula markers exist but Pix2Text is unavailable.
- Docling structure extraction with
- Optional formula env controls:
DOCLING_FORMULA_ENRICHMENT(trueby default)PDF_FORMULA_REQUIRE_ENGINE(trueby default)PDF_FORMULA_BATCH_SIZE(autoby default; auto-detected from CPU and memory, or set an explicit integer)
- Optional Docling performance controls:
document_autotune_pdf_parserbenchmarks a sampled subset of one PDF and writes the selected profile to a local JSON cache.- By default the tuning profile lives at
~/Library/Caches/mcp-ebook-read/docling_pdf_tuning.jsonon macOS and$XDG_CACHE_HOME/mcp-ebook-read/docling_pdf_tuning.json(or~/.cache/...) elsewhere. PDF_DOCLING_NUM_THREADSandPDF_DOCLING_BATCH_SIZEoverride the cached profile when you need a fixed setting.
- Sidecar cleanup is explicit:
library_scanno longer triggers threshold-based auto compaction.- Use
storage_cleanup_sidecars(..., compact_catalog=true)when you want compaction.
- Ingest is now asynchronous by design:
- the
document_ingest_*tools submit work and return immediately withjob_id/doc_id; - poll
document_ingest_status(doc_id=..., job_id=...)untilstatusbecomessucceededorfailed; - use
document_ingest_list_jobs(doc_id=...)when you need recent history or lost the latestjob_id.
- the
No-Label Formula Benchmark
Use your own non-scanned PDF corpus as a no-label regression baseline (without manual annotations).
uvx mcp-ebook-formula-benchmark \
--samples-dir /ABSOLUTE/PATH/TO/pdf-formula-benchmark-corpus \
--passes 2 \
--max-unresolved-rate 0.15 \
--min-latex-valid-rate 0.85 \
--min-stability-rate 1.0
Output is JSON with per-document metrics and a threshold pass/fail flag. Exit code is 0 when thresholds pass, otherwise 2.
Claude Code MCP Configuration (JSON via uvx)
You can register this server in a Claude Code compatible mcpServers JSON config.
Published package
{
"mcpServers": {
"mcp-ebook-read": {
"command": "uvx",
"args": [
"mcp-ebook-read"
],
"env": {
"QDRANT_URL": "http://127.0.0.1:6333",
"QDRANT_COLLECTION": "mcp_ebook_read_chunks",
"GROBID_URL": "http://127.0.0.1:8070",
"GROBID_TIMEOUT_SECONDS": "120"
}
}
}
}
Security note
- Do not put real passwords, API keys, or tokens directly in committed JSON files.
- Use environment variables or secret managers, and keep example values as placeholders only.
Codex MCP Configuration (TOML)
You can also configure MCP servers in Codex using TOML style (for example in a Codex MCP config file).
Example
[mcp_servers.mcp-ebook-read]
command = "uvx"
args = [ "mcp-ebook-read" ]
startup_timeout_sec = 60
[mcp_servers.mcp-ebook-read.env]
QDRANT_URL = "http://127.0.0.1:6333"
QDRANT_COLLECTION = "mcp_ebook_read_chunks"
GROBID_URL = "http://127.0.0.1:8070"
GROBID_TIMEOUT_SECONDS = "120"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mcp_ebook_read-0.1.10.tar.gz.
File metadata
- Download URL: mcp_ebook_read-0.1.10.tar.gz
- Upload date:
- Size: 105.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db4d6c5ff34007e387668e57c70d6dc56c69831b69bd1237a9e2b591840fbb85
|
|
| MD5 |
a26d41854f6460c2695def04bfc83deb
|
|
| BLAKE2b-256 |
8f987b8971188a31b059dc17ef412b0e0de11e112ab034e1b7cc9999785e5c1e
|
File details
Details for the file mcp_ebook_read-0.1.10-py3-none-any.whl.
File metadata
- Download URL: mcp_ebook_read-0.1.10-py3-none-any.whl
- Upload date:
- Size: 75.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b53c911367a3468a1be1eb8a0b147f51299ac6027e523c2e99bb7dc08ba9236
|
|
| MD5 |
a0cfbfc71c3aaa14a9c503efc7226829
|
|
| BLAKE2b-256 |
0bf0f99e5bdcda161afd649dde3757148bbce4aeb96e4c198848b44a466c118d
|