Pull documentation from the web and convert to clean markdown

These details have not been verified by PyPI

Project description

docpull

Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.

docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.

Install

pip install docpull

# Optional extras
pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # run as an MCP server for AI agents
pip install 'docpull[all]'           # everything above

Quick start

# Crawl and save Markdown
docpull https://docs.example.com

# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache

Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull content directly from framework data feeds:

Framework	Strategy
Next.js	Parses `__NEXT_DATA__` JSON
Mintlify	`__NEXT_DATA__` with Mintlify tagging
OpenAPI	Renders `openapi.json` / `swagger.json` into Markdown
Docusaurus	Detected and tagged; generic extractor produces Markdown
Sphinx	Detected and tagged; generic extractor produces Markdown

JS-only SPAs with no server-rendered content are detected and skipped with a clear reason (or, with --strict-js-required, reported as an error so agents can route elsewhere).

Agent-friendly features

--single — fetch a single URL without discovery. Designed for tool loops.
--stream — NDJSON one-record-per-line, flushed on every page, pipeable.
--max-tokens-per-file N — split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without).
--emit-chunks — write one file or record per chunk instead of per page.
--strict-js-required — hard-fail on JS-only pages instead of silently skipping.
--extractor trafilatura — swap in trafilatura for sites where the default heuristics struggle.

Python API

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])

Async streaming:

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    cfg = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.LLM,  # chunked NDJSON output
    )
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")
        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Single-page from an agent tool:

from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:
    async with Fetcher(DocpullConfig(url=url)) as f:
        ctx = await f.fetch_one(url, save=False)
        return ctx.markdown or ctx.error or ""

Profiles

docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
docpull https://site.com --profile mirror   # Full archive, polite, cached.
docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.

MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it directly over stdio:

pip install 'docpull[mcp]'
docpull mcp  # starts the stdio server

Add to Claude Desktop or Claude Code manually:

{
  "mcpServers": {
    "docpull": {
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

Or, if you use Claude Code, install the plugin instead — it bundles the MCP server, five slash commands (/docs-add, /docs-search, /docs-list, /docs-refresh, /docs-remove), and a meta-skill that teaches Claude when to reach for docpull automatically:

# 1. Install docpull with the MCP extra (required for the plugin)
pip install 'docpull[mcp]'

# 2. Then in Claude Code:
/plugin marketplace add raintree-technology/docpull
/plugin install docpull@docpull

See plugin/README.md for details.

Tools exposed (8 total — read tools advertise readOnlyHint so hosts that auto-approve safe tools won't prompt):

Read:

fetch_url(url, max_tokens?) — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
list_sources(category?) — show available aliases (react, nextjs, fastapi, …)
list_indexed() — what has been fetched locally, with last-fetched age
grep_docs(pattern, library?, limit?, context?) — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
read_doc(library, path, line_start?, line_end?) — read a specific cached file, optionally line-sliced

Write:

ensure_docs(source, force?, profile?) — fetch a named library (cached 7 days). Forwards progress to clients that supply a progressToken.
add_source(name, url, description?, category?, max_pages?, force?) — register a user alias (HTTPS-only, atomic write to sources.yaml).
remove_source(name, delete_cache?) — drop a user alias and (optionally) its cached docs.

All tools that carry data also return structuredContent validated against an outputSchema for clients that prefer typed output.

User-defined sources live in ~/.config/docpull-mcp/sources.yaml:

sources:
  mydocs:
    url: https://docs.example.com
    description: My internal docs
    category: internal
    maxPages: 200

About the `mcp/` directory in this repo

The mcp/ directory at the repo root is a separate TypeScript + Bun MCP server backed by PostgreSQL with pgvector for semantic search. It is not the Python MCP server shipped in the docpull package described above — that one is the right choice for almost every user and is installed with pip install 'docpull[mcp]'. The mcp/ tree is mirrored to its own repo at raintree-technology/docpull-mcp; unless you specifically need pgvector-backed semantic search, ignore it and use docpull mcp.

Output

Markdown files with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---

# Getting Started
…

NDJSON (one record per page or chunk):

{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}

Security

HTTPS-only, mandatory robots.txt compliance
SSRF protection: blocks private/internal network IPs, DNS rebinding via connect-time address pinning
XXE protection via defusedxml on sitemaps
Path traversal and CRLF header injection guards
Auth headers stripped on cross-origin redirects

When running with --proxy, DNS pinning is delegated to the proxy. Pass --require-pinned-dns to refuse this configuration and keep the connector- level SSRF guarantees in effect.

Options

Run docpull --help for the full list. Highlights:

Core:
  --profile {rag,mirror,quick,llm,custom}
  --single                Fetch one URL (no crawl)
  --format {markdown,json,ndjson,sqlite}
  --stream                Stream NDJSON to stdout

LLM / chunking:
  --max-tokens-per-file N
  --tokenizer NAME        tiktoken encoding (default cl100k_base)
  --emit-chunks           One file/record per chunk

Content extraction:
  --extractor {default,trafilatura}
  --no-special-cases      Disable framework extractors
  --strict-js-required    Error on JS-only pages

Cache:
  --cache                 Enable incremental updates
  --cache-dir DIR
  --cache-ttl DAYS

Performance

End-to-end numbers from tests/benchmarks/test_10k_pages.py against a synthetic 10,000-page localhost site (RAG profile, max_concurrent=50, HTTP keep-alive, 5% injected duplicate content):

Metric	Value
Total wall time	~27 s
Discovery (sitemap parse)	~80 ms
Fetch + convert + save	~27 s
Per-page latency p50 / p95 / p99	~2.6 / 4.6 / 5.3 ms
Peak RSS delta from baseline	~28 MB
Cache manifest size on disk	~3.4 MB
Duplicates detected (5% injected)	499 / 500

Reproduce with make benchmark (requires aiohttp; runs the gated benchmark in tests/benchmarks/ and prints a JSON line you can pipe into trend tooling).

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading
docpull URL --preview-urls    # List URLs without fetching

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

3.0.0

Apr 27, 2026

2.5.1

Apr 26, 2026

2.5.0

Apr 26, 2026

2.4.0

Apr 26, 2026

2.3.0

Apr 24, 2026

2.2.0

Dec 15, 2025

2.0.0

Nov 29, 2025

1.5.0

Nov 28, 2025

1.3.0

Nov 20, 2025

1.2.1

Nov 17, 2025

1.2.0

Nov 16, 2025

1.1.0

Nov 14, 2025

1.0.2

Nov 14, 2025

1.0.1

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-3.0.0.tar.gz (146.4 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpull-3.0.0-py3-none-any.whl (137.3 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file docpull-3.0.0.tar.gz.

File metadata

Download URL: docpull-3.0.0.tar.gz
Upload date: Apr 27, 2026
Size: 146.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`eb39eb3a845876fbe84e98f7ea7178b949ae1a55089a149f29c5d601c9039598`
MD5	`5c7c5df2536e5b6e7066a5d2ea131d0f`
BLAKE2b-256	`3d208aa10b06cfd5353ebdfcb42b1bc07b958e5b0aa32b768a23dc6b19c10c94`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-3.0.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-3.0.0.tar.gz
- Subject digest: eb39eb3a845876fbe84e98f7ea7178b949ae1a55089a149f29c5d601c9039598
- Sigstore transparency entry: 1391980235
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: raintree-technology/docpull@a5eaefc7bca4011c9e7a0e7ee4fc4b322980118d
- Branch / Tag: refs/tags/v3.0.0
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5eaefc7bca4011c9e7a0e7ee4fc4b322980118d
- Trigger Event: push

File details

Details for the file docpull-3.0.0-py3-none-any.whl.

File metadata

Download URL: docpull-3.0.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 137.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c72a459a078ac1e8dc9b7868d01dfd7d44501be374cf4d5ce8ebcf20b8e3287`
MD5	`156f77729c309f992b1caef01f9e71b3`
BLAKE2b-256	`9f2b49548d250c6dab5ddf7444f3d2b4f130aed04e72b9ad8b8f2411deaa17b6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-3.0.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-3.0.0-py3-none-any.whl
- Subject digest: 9c72a459a078ac1e8dc9b7868d01dfd7d44501be374cf4d5ce8ebcf20b8e3287
- Sigstore transparency entry: 1391980241
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: raintree-technology/docpull@a5eaefc7bca4011c9e7a0e7ee4fc4b322980118d
- Branch / Tag: refs/tags/v3.0.0
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a5eaefc7bca4011c9e7a0e7ee4fc4b322980118d
- Trigger Event: push

docpull 3.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

docpull

Install

Quick start

Framework-aware extraction

Agent-friendly features

Python API

Profiles

MCP server

About the mcp/ directory in this repo

Output

Security

Options

Performance

Troubleshooting

Links

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

About the `mcp/` directory in this repo