MCP server for managing AI-friendly document collections — convert PDFs, split by chapter, index for chat projects.
Project description
docshelf-mcp
Put your manuals on a shelf, hand the AI the index.
___ __ ____ ____ _ _ ____ __ ____
/ __)/ \(_ _)/ ___)/ )( \( __)( ) ( __)
( (_ \( O ) )( \___ \) __ ( ) _) / (_/\ ) _)
\___/ \__/ (__) (____/\_)(_/(____)\____/(__)
MCP server for AI-friendly doc shelves
An MCP server that turns a folder of PDFs and Markdown into a chat-project-friendly document collection: AI agents see a single INDEX.md and pull individual sections by raw GitHub URL on demand — instead of choking on a 4 MB datasheet.
Why?
You have 30 hardware manuals, or 200 cooking recipes, or a stack of research PDFs.
You want Claude / ChatGPT / whatever to be able to answer questions across them — but:
- ❌ You can't dump 80 MB of PDFs into a chat project. It won't fit, and you'd burn the context window even if it did.
- ❌ You can manually copy-paste the relevant pages, but only after you remember which manual mentioned the thing you need.
- ❌ Long files mean retrieval is wasteful — the model loads the whole RouterOS guide just to answer a question about VLANs.
docshelf-mcp solves it like this:
- You drop a PDF onto the shelf.
- The shelf converts it to Markdown, splits big files chapter-by-chapter, and regenerates a navigation
INDEX.md. - You commit and push to a public GitHub repo.
- Add only
INDEX.mdto your Claude project. When the model needs a section, it fetches it viaraw.githubusercontent.com.
Result: a 5 KB index pointing at a 50 MB collection. The model reads exactly the chapter it needs.
📦 Install
From PyPI (once the first tagged release is published):
# uv (recommended)
uv pip install docshelf-mcp
# or plain pip
pip install docshelf-mcp
Or straight from main (always-latest, no PyPI required):
pip install "git+https://github.com/ignatenkofi/docshelf-mcp"
Optional high-quality PDF engine (pulls ~2 GB of PyTorch — only if you need it):
pip install "docshelf-mcp[high-quality]"
📋 Project Prompt
Drop this into the Custom Instructions of any Claude project that consumes
a docshelf-style INDEX.md:
This project uses the docshelf pattern.
INDEX.mdis the entry point. When answering: read INDEX → fetch ONLY the needed section file via its GitHub raw URL (use WebFetch / fetch / curl). Don't load full source files into context. For large manuals split into chapters, follow INDEX → chapter SUBINDEX → section file.
Medium (~150 words) and full (~400 words) versions, plus how-to snippets for
Claude Code, Claude Desktop, and the Anthropic API, live in
docs/PROJECT_PROMPT.md.
Quickstart (Python library)
from docshelf_mcp import Shelf
shelf = Shelf("~/Documents/my-homelab-docs").init(
name="My HomeLab Docs",
remote="https://github.com/me/my-homelab-docs",
default_categories=["routers", "switches", "psu", "motherboards"],
)
shelf.add_document(
"~/Downloads/MIKROTIK_RouterOS.pdf",
category="routers",
title="Mikrotik RouterOS — full manual",
description="Official RouterOS reference, split by chapter.",
)
# → docs/routers/mikrotik-routeros-full-manual.md + docs/routers/.../001-..md, 002-..md, ...
# → INDEX.md is regenerated automatically.
Then in the shelf directory: git add . && git commit -m "docs: add RouterOS" && git push.
In your Claude project, attach only INDEX.md. Done.
Quickstart (MCP server)
1. Add to Claude Desktop
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):
{
"mcpServers": {
"docshelf": {
"command": "docshelf-mcp",
"env": {
"DOCSHELF_ROOT": "/Users/me/Documents/my-homelab-docs"
}
}
}
}
Restart Claude Desktop. You now have six new tools available:
| Tool | What it does |
|---|---|
docshelf_init_shelf |
Bootstrap a new shelf directory. |
docshelf_add_document |
Add a PDF/MD file. Converts, splits, re-indexes. |
docshelf_rebuild_index |
Regenerate INDEX.md from disk. |
docshelf_search |
Plain-text search across the shelf, with raw URLs. |
docshelf_list_documents |
List documents by category. |
docshelf_convert_pdf |
Standalone PDF → Markdown (no shelf). |
2. Add to Claude Code
claude mcp add docshelf -- docshelf-mcp
# Optional: set the default shelf
claude mcp add docshelf --env DOCSHELF_ROOT=/path/to/shelf -- docshelf-mcp
3. Test from the command line
# Sanity check — should print the server version then wait on stdin
docshelf-mcp
The shelf layout
my-shelf/
├── .docshelf.json ← shelf metadata: name, remote, category order
├── INDEX.md ← auto-generated navigation (your chat-project file)
├── .gitignore
└── docs/
├── routers/
│ ├── .meta.json ← per-document title/description overrides
│ ├── mikrotik-routeros.md (full document, lightly cleaned)
│ └── mikrotik-routeros/ (auto-split sections)
│ ├── 001-overview.md
│ ├── 002-bridging.md
│ └── 003-firewall.md
└── switches/
└── cudy-gs1010pe.md
Everything in docs/ is committed; everything is fetchable via raw URL once you push to GitHub.
How splitting works
A document is split when both conditions hold:
- UTF-8 size > 50 KB (configurable via
.docshelf.json:split_threshold_bytes). - The document has at least two
##(H2) headings.
The splitter:
- Cleans PDF-extraction noise (collapses runaway blank lines, demotes CLI dumps mistaken for H1s).
- Slices on H2 boundaries.
- Names files
NNN-<slug>.mdso they sort naturally and survive title changes. - Wipes the previous split directory before regenerating — fully idempotent.
If you want to keep a document whole, pass split=False.
Examples
See the examples/ directory for three concrete use cases:
examples/homelab/— original use case, hardware manuals for a home lab.examples/recipes/— a cookbook with one recipe per file.examples/research-papers/— academic PDFs with abstracts in.meta.json.
Each example shows the directory layout and the INDEX.md you'd end up with.
Optional: high-quality PDF conversion
The default engine (pymupdf4llm) is fast and good enough for ~95% of technical documents. For papers with complex tables, math, or scanned content, install the marker-pdf backend:
pip install "docshelf-mcp[high-quality]"
Then pass quality="high":
shelf.add_document("paper.pdf", category="research", title="...", quality="high")
⚠️ marker-pdf pulls in PyTorch (~2 GB) and is significantly slower (10–60 s per document on CPU). The library import is deferred — if you don't use quality="high", the dependency is never loaded.
FAQ
Why GitHub raw URLs and not embeddings / RAG? Because it's dead simple, costs nothing to host, and the AI is already good at chasing links. You can layer embedding search on top later if you want — the on-disk shape is a normal git repo.
Does this work with private repos?
Not for the raw-URL trick — raw.githubusercontent.com won't serve them without auth. The local search tool works fine on private shelves; you just lose the "AI fetches sections directly" benefit. Make the doc repo public (separate from your code repo).
Do I have to use GitHub?
No. The shelf is just a directory. If you don't set a github_remote, INDEX.md still gets generated — entries just won't have URLs. You can host the static files anywhere that serves raw text (S3, Cloudflare R2, GitLab raw, Gitea, …) and post-process URLs yourself.
Does it edit the source PDFs?
No. PDFs are converted on add_document and the source is left in place. The shelf only writes inside its own directory.
What about non-English documents?
Slugify is Unicode-aware (NFKD-normalized, with \w under re.UNICODE). Cyrillic / CJK titles slug down to ASCII-ish forms; the body Markdown is preserved as-is.
Can I use it without MCP?
Yes — from docshelf_mcp import Shelf and use the class directly. See docs/USAGE.md.
Limitations
- Public GitHub only for the raw-URL trick (or whatever public static host you wire up).
- Single repo per shelf. If you outgrow one repo, run multiple shelves and attach multiple
INDEX.mds. - Heuristic splitting. The PDF→Markdown extract isn't always clean enough to split cleanly. For pathological cases (some 4+ MB datasheets), keep the file whole and rely on
docshelf_search. - No automatic git commit. Tools regenerate
INDEX.mdon disk, but the caller (you, or an agent) is responsible forgit add / commit / push. This is intentional — staying out of git's way keeps the tool safe to call from agents.
Demo
A short walkthrough video / GIF is planned: https://github.com/ignatenkofi/docshelf-mcp/blob/main/docs/demo.md (coming soon)
Architecture
For a deeper dive, see docs/ARCHITECTURE.md — module layout, data flow, design rationale.
Contributing
Bug reports and PRs welcome. To set up a dev env:
git clone https://github.com/ignatenkofi/docshelf-mcp
cd docshelf-mcp
uv pip install -e ".[dev]"
ruff check src tests
pytest -v
License
MIT — see LICENSE.
Origin
docshelf-mcp started life as a 350-line Python script (homelab-encyclopedia.py) that managed a single homelab manuals repo. The split / index / clean logic is the same code, generalised to work for any category-organised document collection.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docshelf_mcp-0.2.0.tar.gz.
File metadata
- Download URL: docshelf_mcp-0.2.0.tar.gz
- Upload date:
- Size: 39.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a33c21f54dcec4940a67344cd9f295f9d34d04a0bb74d39098f947afa8eb8222
|
|
| MD5 |
7b5483c4977a4ffa0f9451216b74ea3c
|
|
| BLAKE2b-256 |
23152f05682fe87f60b5713771b8bdeb5e54e4d8fad9dc7be6e7bd1c481647c8
|
Provenance
The following attestation bundles were made for docshelf_mcp-0.2.0.tar.gz:
Publisher:
release.yml on ignatenkofi/docshelf-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docshelf_mcp-0.2.0.tar.gz -
Subject digest:
a33c21f54dcec4940a67344cd9f295f9d34d04a0bb74d39098f947afa8eb8222 - Sigstore transparency entry: 1539191037
- Sigstore integration time:
-
Permalink:
ignatenkofi/docshelf-mcp@0b499e334fa852dad1a8dde5da38a370750133f6 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ignatenkofi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b499e334fa852dad1a8dde5da38a370750133f6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docshelf_mcp-0.2.0-py3-none-any.whl.
File metadata
- Download URL: docshelf_mcp-0.2.0-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e804d2d3fa85e36735bca4d78b6e6a2440d239e349c258ae12dcb742472e510c
|
|
| MD5 |
524e4172ab04f3311d6031f606c8c946
|
|
| BLAKE2b-256 |
14f2094899a73a98c3c64f3a4309ba31358738c45fdccd85f1133378d6c99414
|
Provenance
The following attestation bundles were made for docshelf_mcp-0.2.0-py3-none-any.whl:
Publisher:
release.yml on ignatenkofi/docshelf-mcp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docshelf_mcp-0.2.0-py3-none-any.whl -
Subject digest:
e804d2d3fa85e36735bca4d78b6e6a2440d239e349c258ae12dcb742472e510c - Sigstore transparency entry: 1539191194
- Sigstore integration time:
-
Permalink:
ignatenkofi/docshelf-mcp@0b499e334fa852dad1a8dde5da38a370750133f6 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/ignatenkofi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b499e334fa852dad1a8dde5da38a370750133f6 -
Trigger Event:
push
-
Statement type: