Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.

Python 3.10+ PyPI version Downloads License: MIT

docpull demo

docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.

Install

pip install docpull

# Optional extras
pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # run as an MCP server for AI agents
pip install 'docpull[all]'           # everything above

Quick start

# Crawl and save Markdown
docpull https://docs.example.com

# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache

Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull content directly from framework data feeds:

Framework Strategy
Next.js Parses __NEXT_DATA__ JSON
Mintlify __NEXT_DATA__ with Mintlify tagging
OpenAPI Renders openapi.json / swagger.json into Markdown
Docusaurus Detected and tagged; generic extractor produces Markdown
Sphinx Detected and tagged; generic extractor produces Markdown

JS-only SPAs with no server-rendered content are detected and skipped with a clear reason (or, with --strict-js-required, reported as an error so agents can route elsewhere).

Agent-friendly features

  • --single — fetch a single URL without discovery. Designed for tool loops.
  • --stream — NDJSON one-record-per-line, flushed on every page, pipeable.
  • --max-tokens-per-file N — split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without).
  • --emit-chunks — write one file or record per chunk instead of per page.
  • --strict-js-required — hard-fail on JS-only pages instead of silently skipping.
  • --extractor trafilatura — swap in trafilatura for sites where the default heuristics struggle.

Python API

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])

Async streaming:

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    cfg = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.LLM,  # chunked NDJSON output
    )
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")
        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Single-page from an agent tool:

from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:
    async with Fetcher(DocpullConfig(url=url)) as f:
        ctx = await f.fetch_one(url, save=False)
        return ctx.markdown or ctx.error or ""

Profiles

docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
docpull https://site.com --profile llm      # NDJSON + chunks + metadata.
docpull https://site.com --profile mirror   # Full archive, polite, cached.
docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.

MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it directly over stdio:

pip install 'docpull[mcp]'
docpull mcp  # starts the stdio server

Add to Claude Desktop or Claude Code:

{
  "mcpServers": {
    "docpull": {
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

Tools exposed:

  • fetch_url(url, max_tokens?) — one-shot fetch, no crawl
  • ensure_docs(source, force?) — fetch a named library (cached 7 days)
  • list_sources(category?) — show available aliases (react, nextjs, fastapi, …)
  • list_indexed() — what has been fetched locally
  • grep_docs(pattern, library?) — regex search across fetched Markdown

User-defined sources live in ~/.config/docpull-mcp/sources.yaml:

sources:
  mydocs:
    url: https://docs.example.com
    description: My internal docs
    category: internal
    maxPages: 200

Output

Markdown files with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---

# Getting Started

NDJSON (one record per page or chunk):

{"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}

Security

  • HTTPS-only, mandatory robots.txt compliance
  • SSRF protection: blocks private/internal network IPs, DNS rebinding
  • XXE protection via defusedxml on sitemaps
  • Path traversal and CRLF header injection guards
  • Auth headers stripped on cross-origin redirects

Options

Run docpull --help for the full list. Highlights:

Core:
  --profile {rag,mirror,quick,llm,custom}
  --single                Fetch one URL (no crawl)
  --format {markdown,json,ndjson,sqlite}
  --stream                Stream NDJSON to stdout

LLM / chunking:
  --max-tokens-per-file N
  --tokenizer NAME        tiktoken encoding (default cl100k_base)
  --emit-chunks           One file/record per chunk

Content extraction:
  --extractor {default,trafilatura}
  --no-special-cases      Disable framework extractors
  --strict-js-required    Error on JS-only pages

Cache:
  --cache                 Enable incremental updates
  --cache-dir DIR
  --cache-ttl DAYS

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading
docpull URL --preview-urls    # List URLs without fetching

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-2.3.0.tar.gz (110.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-2.3.0-py3-none-any.whl (113.6 kB view details)

Uploaded Python 3

File details

Details for the file docpull-2.3.0.tar.gz.

File metadata

  • Download URL: docpull-2.3.0.tar.gz
  • Upload date:
  • Size: 110.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-2.3.0.tar.gz
Algorithm Hash digest
SHA256 6faa65a2c1d2263465e08f75a70253ffad2d965021eb19adf95a0f940542563a
MD5 11c95129ce8f7572b572320bc6494a45
BLAKE2b-256 eeeeb09747b39a63db457c768812182e31f8b04bb87c131c7c0439cb018a6d82

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.3.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 113.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docpull-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc2b7a03b0d14919a04c2df0158c5f9b72666dc320a0c91e0656d1466f1b9751
MD5 e36bfe4ee9955e4e272d2f119ad29c0a
BLAKE2b-256 002823c857ded758ac970fe603b007048b9f0dbba530e293431dacc3c023eb4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-2.3.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page