MCP server for managing AI-friendly document collections — convert PDFs, split by chapter, index for chat projects.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ignatenkofi

These details have not been verified by PyPI

Project description

docshelf-mcp

Put your manuals on a shelf, hand the AI the index.

   ___  __  ____  ____  _  _  ____  __    ____
  / __)/  \(_  _)/ ___)/ )( \(  __)(  )  (  __)
 ( (_ \(  O ) )(  \___ \) __ ( ) _) / (_/\ ) _)
  \___/ \__/ (__) (____/\_)(_/(____)\____/(__)
       MCP server for AI-friendly doc shelves

An MCP server that turns a folder of PDFs and Markdown into a chat-project-friendly document collection: AI agents see a single INDEX.md and pull individual sections by raw GitHub URL on demand — instead of choking on a 4 MB datasheet.

Why?

You have 30 hardware manuals, or 200 cooking recipes, or a stack of research PDFs.

You want Claude / ChatGPT / whatever to be able to answer questions across them — but:

❌ You can't dump 80 MB of PDFs into a chat project. It won't fit, and you'd burn the context window even if it did.
❌ You can manually copy-paste the relevant pages, but only after you remember which manual mentioned the thing you need.
❌ Long files mean retrieval is wasteful — the model loads the whole RouterOS guide just to answer a question about VLANs.

docshelf-mcp solves it like this:

You drop a PDF onto the shelf.
The shelf converts it to Markdown, splits big files chapter-by-chapter, and regenerates a navigation INDEX.md.
You commit and push to a public GitHub repo.
Add only INDEX.md to your Claude project. When the model needs a section, it fetches it via raw.githubusercontent.com.

Result: a 5 KB index pointing at a 50 MB collection. The model reads exactly the chapter it needs.

📦 Install

From PyPI (once the first tagged release is published):

# uv (recommended)
uv pip install docshelf-mcp

# or plain pip
pip install docshelf-mcp

Or straight from main (always-latest, no PyPI required):

pip install "git+https://github.com/ignatenkofi/docshelf-mcp"

Optional high-quality PDF engine (pulls ~2 GB of PyTorch — only if you need it):

pip install "docshelf-mcp[high-quality]"

📋 Project Prompt

Drop this into the Custom Instructions of any Claude project that consumes a docshelf-style INDEX.md:

This project uses the docshelf pattern. INDEX.md is the entry point. When answering: read INDEX → fetch ONLY the needed section file via its GitHub raw URL (use WebFetch / fetch / curl). Don't load full source files into context. For large manuals split into chapters, follow INDEX → chapter SUBINDEX → section file.

Medium (~150 words) and full (~400 words) versions, plus how-to snippets for Claude Code, Claude Desktop, and the Anthropic API, live in docs/PROJECT_PROMPT.md.

Quickstart (Python library)

from docshelf_mcp import Shelf

shelf = Shelf("~/Documents/my-homelab-docs").init(
    name="My HomeLab Docs",
    remote="https://github.com/me/my-homelab-docs",
    default_categories=["routers", "switches", "psu", "motherboards"],
)

shelf.add_document(
    "~/Downloads/MIKROTIK_RouterOS.pdf",
    category="routers",
    title="Mikrotik RouterOS — full manual",
    description="Official RouterOS reference, split by chapter.",
)
# → docs/routers/mikrotik-routeros-full-manual.md  +  docs/routers/.../001-..md, 002-..md, ...
# → INDEX.md is regenerated automatically.

Then in the shelf directory: git add . && git commit -m "docs: add RouterOS" && git push.

In your Claude project, attach only INDEX.md. Done.

Quickstart (MCP server)

1. Add to Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%/Claude/claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "docshelf": {
      "command": "docshelf-mcp",
      "env": {
        "DOCSHELF_ROOT": "/Users/me/Documents/my-homelab-docs"
      }
    }
  }
}

Restart Claude Desktop. You now have six new tools available:

Tool	What it does
`docshelf_init_shelf`	Bootstrap a new shelf directory.
`docshelf_add_document`	Add a PDF/MD file. Converts, splits, re-indexes.
`docshelf_rebuild_index`	Regenerate `INDEX.md` from disk.
`docshelf_search`	Plain-text search across the shelf, with raw URLs.
`docshelf_list_documents`	List documents by category.
`docshelf_convert_pdf`	Standalone PDF → Markdown (no shelf).

2. Add to Claude Code

claude mcp add docshelf -- docshelf-mcp
# Optional: set the default shelf
claude mcp add docshelf --env DOCSHELF_ROOT=/path/to/shelf -- docshelf-mcp

3. Test from the command line

# Sanity check — should print the server version then wait on stdin
docshelf-mcp

The shelf layout

my-shelf/
├── .docshelf.json        ← shelf metadata: name, remote, category order
├── INDEX.md              ← auto-generated navigation (your chat-project file)
├── .gitignore
└── docs/
    ├── routers/
    │   ├── .meta.json    ← per-document title/description overrides
    │   ├── mikrotik-routeros.md       (full document, lightly cleaned)
    │   └── mikrotik-routeros/         (auto-split sections)
    │       ├── 001-overview.md
    │       ├── 002-bridging.md
    │       └── 003-firewall.md
    └── switches/
        └── cudy-gs1010pe.md

Everything in docs/ is committed; everything is fetchable via raw URL once you push to GitHub.

How splitting works

A document is split when both conditions hold:

UTF-8 size > 50 KB (configurable via .docshelf.json:split_threshold_bytes).
The document has at least two ## (H2) headings.

The splitter:

Cleans PDF-extraction noise (collapses runaway blank lines, demotes CLI dumps mistaken for H1s).
Slices on H2 boundaries.
Names files NNN-<slug>.md so they sort naturally and survive title changes.
Wipes the previous split directory before regenerating — fully idempotent.

If you want to keep a document whole, pass split=False.

Examples

See the examples/ directory for three concrete use cases:

examples/homelab/ — original use case, hardware manuals for a home lab.
examples/recipes/ — a cookbook with one recipe per file.
examples/research-papers/ — academic PDFs with abstracts in .meta.json.

Each example shows the directory layout and the INDEX.md you'd end up with.

Optional: high-quality PDF conversion

The default engine (pymupdf4llm) is fast and good enough for ~95% of technical documents. For papers with complex tables, math, or scanned content, install the marker-pdf backend:

pip install "docshelf-mcp[high-quality]"

Then pass quality="high":

shelf.add_document("paper.pdf", category="research", title="...", quality="high")

⚠️ marker-pdf pulls in PyTorch (~2 GB) and is significantly slower (10–60 s per document on CPU). The library import is deferred — if you don't use quality="high", the dependency is never loaded.

FAQ

Why GitHub raw URLs and not embeddings / RAG? Because it's dead simple, costs nothing to host, and the AI is already good at chasing links. You can layer embedding search on top later if you want — the on-disk shape is a normal git repo.

Does this work with private repos? Not for the raw-URL trick — raw.githubusercontent.com won't serve them without auth. The local search tool works fine on private shelves; you just lose the "AI fetches sections directly" benefit. Make the doc repo public (separate from your code repo).

Do I have to use GitHub? No. The shelf is just a directory. If you don't set a github_remote, INDEX.md still gets generated — entries just won't have URLs. You can host the static files anywhere that serves raw text (S3, Cloudflare R2, GitLab raw, Gitea, …) and post-process URLs yourself.

Does it edit the source PDFs? No. PDFs are converted on add_document and the source is left in place. The shelf only writes inside its own directory.

What about non-English documents? Slugify is Unicode-aware (NFKD-normalized, with \w under re.UNICODE). Cyrillic / CJK titles slug down to ASCII-ish forms; the body Markdown is preserved as-is.

Can I use it without MCP? Yes — from docshelf_mcp import Shelf and use the class directly. See docs/USAGE.md.

Limitations

Public GitHub only for the raw-URL trick (or whatever public static host you wire up).
Single repo per shelf. If you outgrow one repo, run multiple shelves and attach multiple INDEX.mds.
Heuristic splitting. The PDF→Markdown extract isn't always clean enough to split cleanly. For pathological cases (some 4+ MB datasheets), keep the file whole and rely on docshelf_search.
No automatic git commit. Tools regenerate INDEX.md on disk, but the caller (you, or an agent) is responsible for git add / commit / push. This is intentional — staying out of git's way keeps the tool safe to call from agents.

Demo

A short walkthrough video / GIF is planned: https://github.com/ignatenkofi/docshelf-mcp/blob/main/docs/demo.md (coming soon)

Architecture

For a deeper dive, see docs/ARCHITECTURE.md — module layout, data flow, design rationale.

Contributing

Bug reports and PRs welcome. To set up a dev env:

git clone https://github.com/ignatenkofi/docshelf-mcp
cd docshelf-mcp
uv pip install -e ".[dev]"
ruff check src tests
pytest -v

License

MIT — see LICENSE.

Origin

docshelf-mcp started life as a 350-line Python script (homelab-encyclopedia.py) that managed a single homelab manuals repo. The split / index / clean logic is the same code, generalised to work for any category-organised document collection.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ignatenkofi

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docshelf_mcp-0.2.0.tar.gz (39.6 kB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docshelf_mcp-0.2.0-py3-none-any.whl (26.5 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file docshelf_mcp-0.2.0.tar.gz.

File metadata

Download URL: docshelf_mcp-0.2.0.tar.gz
Upload date: May 14, 2026
Size: 39.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docshelf_mcp-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a33c21f54dcec4940a67344cd9f295f9d34d04a0bb74d39098f947afa8eb8222`
MD5	`7b5483c4977a4ffa0f9451216b74ea3c`
BLAKE2b-256	`23152f05682fe87f60b5713771b8bdeb5e54e4d8fad9dc7be6e7bd1c481647c8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docshelf_mcp-0.2.0.tar.gz:

Publisher: release.yml on ignatenkofi/docshelf-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docshelf_mcp-0.2.0.tar.gz
- Subject digest: a33c21f54dcec4940a67344cd9f295f9d34d04a0bb74d39098f947afa8eb8222
- Sigstore transparency entry: 1539191037
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: ignatenkofi/docshelf-mcp@0b499e334fa852dad1a8dde5da38a370750133f6
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ignatenkofi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b499e334fa852dad1a8dde5da38a370750133f6
- Trigger Event: push

File details

Details for the file docshelf_mcp-0.2.0-py3-none-any.whl.

File metadata

Download URL: docshelf_mcp-0.2.0-py3-none-any.whl
Upload date: May 14, 2026
Size: 26.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docshelf_mcp-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e804d2d3fa85e36735bca4d78b6e6a2440d239e349c258ae12dcb742472e510c`
MD5	`524e4172ab04f3311d6031f606c8c946`
BLAKE2b-256	`14f2094899a73a98c3c64f3a4309ba31358738c45fdccd85f1133378d6c99414`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docshelf_mcp-0.2.0-py3-none-any.whl:

Publisher: release.yml on ignatenkofi/docshelf-mcp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docshelf_mcp-0.2.0-py3-none-any.whl
- Subject digest: e804d2d3fa85e36735bca4d78b6e6a2440d239e349c258ae12dcb742472e510c
- Sigstore transparency entry: 1539191194
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: ignatenkofi/docshelf-mcp@0b499e334fa852dad1a8dde5da38a370750133f6
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ignatenkofi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@0b499e334fa852dad1a8dde5da38a370750133f6
- Trigger Event: push

docshelf-mcp 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

docshelf-mcp

Why?

📦 Install

📋 Project Prompt

Quickstart (Python library)

Quickstart (MCP server)

1. Add to Claude Desktop

2. Add to Claude Code

3. Test from the command line

The shelf layout

How splitting works

Examples

Optional: high-quality PDF conversion

FAQ

Limitations

Demo

Architecture

Contributing

License

Origin

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance