Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.

Python 3.10+ PyPI version Downloads License: MIT

docpull demo

docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.

Install

pip install docpull

# Optional extras
pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # run as an MCP server for AI agents
pip install 'docpull[all]'           # everything above

Quick start

# Crawl and save Markdown
docpull https://docs.example.com

# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache

Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull content directly from framework data feeds:

Framework Strategy
Next.js Parses __NEXT_DATA__ JSON
Mintlify __NEXT_DATA__ with Mintlify tagging
OpenAPI Renders openapi.json / swagger.json into Markdown
Docusaurus Detected and tagged; generic extractor produces Markdown
Sphinx Detected and tagged; generic extractor produces Markdown

JS-only SPAs with no server-rendered content are detected and skipped with a clear reason (or, with --strict-js-required, reported as an error so agents can route elsewhere).

Agent-friendly features

  • --single — fetch a single URL without discovery. Designed for tool loops.
  • --stream — NDJSON one-record-per-line, flushed on every page, pipeable.
  • --max-tokens-per-file N — split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without).
  • --emit-chunks — write one file or record per chunk instead of per page.
  • --strict-js-required — hard-fail on JS-only pages instead of silently skipping.
  • --extractor trafilatura — swap in trafilatura for sites where the default heuristics struggle.

Python API

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])

Async streaming:

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    cfg = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.LLM,  # chunked NDJSON output
    )
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")
        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Single-page from an agent tool:

from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:
    async with Fetcher(DocpullConfig(url=url)) as f:
        ctx = await f.fetch_one(url, save=False)
        return ctx.markdown or ctx.error or ""

Profiles

docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
docpull https://site.com --profile mirror   # Full archive, polite, cached.
docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.

MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it directly over stdio:

pip install 'docpull[mcp]'
docpull mcp  # starts the stdio server

Add to Claude Desktop or Claude Code manually:

{
  "mcpServers": {
    "docpull": {
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

Or, if you use Claude Code, install the plugin instead — it bundles the MCP server, five slash commands (/docs-add, /docs-search, /docs-list, /docs-refresh, /docs-remove), and a meta-skill that teaches Claude when to reach for docpull automatically:

# 1. Install docpull with the MCP extra (required for the plugin)
pip install 'docpull[mcp]'
# 2. Then in Claude Code:
/plugin marketplace add raintree-technology/docpull
/plugin install docpull@docpull

See plugin/README.md for details.

Tools exposed (8 total — read tools advertise readOnlyHint so hosts that auto-approve safe tools won't prompt):

Read:

  • fetch_url(url, max_tokens?) — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
  • list_sources(category?) — show available aliases (react, nextjs, fastapi, …)
  • list_indexed() — what has been fetched locally, with last-fetched age
  • grep_docs(pattern, library?, limit?, context?) — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
  • read_doc(library, path, line_start?, line_end?) — read a specific cached file, optionally line-sliced

Write:

  • ensure_docs(source, force?, profile?) — fetch a named library (cached 7 days). Forwards progress to clients that supply a progressToken.
  • add_source(name, url, description?, category?, max_pages?, force?) — register a user alias (HTTPS-only, atomic write to sources.yaml).
  • remove_source(name, delete_cache?) — drop a user alias and (optionally) its cached docs.

All tools that carry data also return structuredContent validated against an outputSchema for clients that prefer typed output.

User-defined sources live in ~/.config/docpull-mcp/sources.yaml:

sources:
  mydocs:
    url: https://docs.example.com
    description: My internal docs
    category: internal
    maxPages: 200

About the mcp/ directory in this repo

The mcp/ directory at the repo root is a separate TypeScript + Bun MCP server backed by PostgreSQL with pgvector for semantic search. It is not the Python MCP server shipped in the docpull package described above — that one is the right choice for almost every user and is installed with pip install 'docpull[mcp]'. The mcp/ tree is mirrored to its own repo at raintree-technology/docpull-mcp; unless you specifically need pgvector-backed semantic search, ignore it and use docpull mcp.

Output

Markdown files with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---

# Getting Started

NDJSON (one record per page or chunk):

{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}

Security

  • HTTPS-only, mandatory robots.txt compliance
  • SSRF protection: blocks private/internal network IPs, DNS rebinding via connect-time address pinning
  • XXE protection via defusedxml on sitemaps
  • Path traversal and CRLF header injection guards
  • Auth headers stripped on cross-origin redirects

When running with --proxy, DNS pinning is delegated to the proxy. Pass --require-pinned-dns to refuse this configuration and keep the connector- level SSRF guarantees in effect.

Options

Run docpull --help for the full list. Highlights:

Core:
  --profile {rag,mirror,quick,llm,custom}
  --single                Fetch one URL (no crawl)
  --format {markdown,json,ndjson,sqlite}
  --stream                Stream NDJSON to stdout

LLM / chunking:
  --max-tokens-per-file N
  --tokenizer NAME        tiktoken encoding (default cl100k_base)
  --emit-chunks           One file/record per chunk

Content extraction:
  --extractor {default,trafilatura}
  --no-special-cases      Disable framework extractors
  --strict-js-required    Error on JS-only pages

Cache:
  --cache                 Enable incremental updates
  --cache-dir DIR
  --cache-ttl DAYS

Performance

End-to-end numbers from tests/benchmarks/test_10k_pages.py against a synthetic 10,000-page localhost site (RAG profile, max_concurrent=50, HTTP keep-alive, 5% injected duplicate content):

Metric Value
Total wall time ~27 s
Discovery (sitemap parse) ~80 ms
Fetch + convert + save ~27 s
Per-page latency p50 / p95 / p99 ~2.6 / 4.6 / 5.3 ms
Peak RSS delta from baseline ~28 MB
Cache manifest size on disk ~3.4 MB
Duplicates detected (5% injected) 499 / 500

Reproduce with make benchmark (requires aiohttp; runs the gated benchmark in tests/benchmarks/ and prints a JSON line you can pipe into trend tooling).

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading
docpull URL --preview-urls    # List URLs without fetching

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-3.0.0.tar.gz (146.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-3.0.0-py3-none-any.whl (137.3 kB view details)

Uploaded Python 3

File details

Details for the file docpull-3.0.0.tar.gz.

File metadata

  • Download URL: docpull-3.0.0.tar.gz
  • Upload date:
  • Size: 146.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-3.0.0.tar.gz
Algorithm Hash digest
SHA256 eb39eb3a845876fbe84e98f7ea7178b949ae1a55089a149f29c5d601c9039598
MD5 5c7c5df2536e5b6e7066a5d2ea131d0f
BLAKE2b-256 3d208aa10b06cfd5353ebdfcb42b1bc07b958e5b0aa32b768a23dc6b19c10c94

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-3.0.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 137.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c72a459a078ac1e8dc9b7868d01dfd7d44501be374cf4d5ce8ebcf20b8e3287
MD5 156f77729c309f992b1caef01f9e71b3
BLAKE2b-256 9f2b49548d250c6dab5ddf7444f3d2b4f130aed04e72b9ad8b8f2411deaa17b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-3.0.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page