Skip to main content

Pull documentation from the web and convert to clean markdown

Project description

docpull

Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.

Python 3.10+ PyPI version Downloads License: MIT

docpull demo

docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.

Install

pip install docpull

# Optional extras
pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # run as an MCP server for AI agents
pip install 'docpull[all]'           # everything above

Quick start

# Crawl and save Markdown
docpull https://docs.example.com

# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache

Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull content directly from framework data feeds:

Framework Strategy
Next.js Parses __NEXT_DATA__ JSON
Mintlify __NEXT_DATA__ with Mintlify tagging
OpenAPI Renders openapi.json / swagger.json into Markdown
Docusaurus Detected and tagged; generic extractor produces Markdown
Sphinx Detected from generator metadata / Read the Docs hosts and tagged; generic extractor produces Markdown

JS-only SPAs with no server-rendered content are detected and skipped with a clear reason (or, with --strict-js-required, reported as an error so agents can route elsewhere).

Agent-friendly features

  • --single — fetch a single URL without discovery. Designed for tool loops.
  • --stream — NDJSON one-record-per-line, flushed on every page, pipeable.
  • --max-tokens-per-file N — split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without).
  • --emit-chunks — write one file or record per chunk instead of per page.
  • --strict-js-required — hard-fail on JS-only pages instead of silently skipping.
  • --extractor trafilatura — swap in trafilatura for sites where the default heuristics struggle.

Python API

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])

Async streaming:

import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    cfg = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.LLM,  # chunked NDJSON output
    )
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")
        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())

Single-page from an agent tool:

from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:
    async with Fetcher(DocpullConfig(url=url)) as f:
        ctx = await f.fetch_one(url, save=False)
        return ctx.markdown or ctx.error or ""

Profiles

docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
docpull https://site.com --profile llm      # NDJSON + chunks + metadata; JS-only pages skip unless --strict-js-required is passed.
docpull https://site.com --profile mirror   # Full archive, polite, cached, hierarchical paths.
docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.

Parallel context packs

docpull[parallel] adds an optional source-discovery and research layer on top of Parallel web intelligence APIs. Use the core crawler when you already know the docs URL. Use Parallel packs when an agent needs current web sources found, extracted, researched, organized, and loaded as durable local context.

Parallel handles live Search, Extract, Task, Entity Search, FindAll, TaskGroup, and Monitor calls. docpull turns those results into local, inspectable packs with Markdown, NDJSON, manifests, hashes, source indexes, workflow metadata, and an AGENT_CONTEXT.md load plan so agents do not need to reverse-engineer raw API responses before using the pack.

Live calls use your own Parallel API key. docpull never proxies requests through a shared account or writes the key into pack artifacts. For normal PyPI installs, docpull parallel init stores the key in a local secrets file with restricted permissions so it works across shells and agent sessions. Pack artifacts can still contain source content, workflow inputs/outputs, selected URLs, and metadata, so treat them as research data.

pip install 'docpull[parallel]'

# Recommended local setup. Prompts securely and writes ~/.config/docpull/secrets.env.
docpull parallel init
docpull parallel auth --json

# Optional project-local setup for agent worktrees. Writes .env.local and
# adds it to .gitignore.
docpull parallel init --project

# CI can still use a normal environment secret instead.
export PARALLEL_API_KEY="<your-parallel-api-key>"

docpull parallel context-pack "Compare AI web-search APIs for agents" \
  --query "AI web search API" \
  --query "agent web extraction API" \
  --include-domain parallel.ai \
  --exclude-domain onparallel.com \
  --output-dir ./packs/ai-web-search \
  --mode advanced \
  --extract-limit 8 \
  --max-estimated-cost 0.05 \
  --task-brief

docpull parallel auth checks local SDK/key presence without printing the key value and reports the key source (env, project_env, user_config, or missing). It does not make a live Parallel call or prove the key is valid. Use --json when an agent or CI job needs machine-readable configuration status.

The pack contains:

  • AGENT_CONTEXT.md — agent load plan with source order, pack signals, warnings, errors, and artifact map.
  • documents.ndjson — chunked records for agents and RAG pipelines.
  • corpus.manifest.json — stable docpull record manifest.
  • parallel.pack.json — Parallel workflow metadata, selected URLs, errors, IDs, and usage.
  • sources.md and sources/*.md — human-readable sources and extracted Markdown.
  • brief.md when --task-brief is enabled.

That makes the Parallel layer more than a pass-through wrapper: Parallel finds and enriches live web sources, while docpull normalizes them into a repeatable context pack that can be scored, diffed, audited, and loaded offline.

For planning without spending credits, use --dry-run. For repeatable packs, use --include-domain, --exclude-domain, --after-date, and --max-search-results to pin the source policy that appears later in parallel.pack.json. Use --fetch-max-age-seconds, --fetch-timeout-seconds, --disable-cache-fallback, --excerpt-chars-per-result, and --location when you need Parallel's freshness, excerpt-size, and geo controls.

Offline fixtures are supported for demos and tests without a Parallel account:

docpull parallel demo --output-dir ./packs/demo

When working from the source checkout, the same fixture can also be imported directly with docpull parallel import ./docs/examples/parallel-search-extract.json.

The core workflow is Search + Extract + optional Task. Broader pack workflows cover Search-only snapshots, docs discovery plans, Extract-known-URL packs, core-fetch-with-Parallel-fallback packs, diff briefs, Task lifecycle snapshots, Entity Search, FindAll lifecycle actions, TaskGroup batches, Monitor metadata/events, API specs, and local source scoring. See docs/parallel.md for the Parallel product cross-reference and the boundary between docpull, Parallel MCP, and planned workflows.

Additional Parallel-backed artifact commands:

# Readiness and API-key setup checks.
docpull parallel auth
docpull parallel init
docpull parallel init --project

# Search + Extract + optional Task brief.
docpull parallel context-pack "Compare AI web-search APIs for agents" \
  --query "AI web search API" \
  --include-domain parallel.ai \
  --dry-run

# Search-only, docs-discovery, known-URL extract, fallback, and structured Task packs.
docpull parallel search-pack "Parallel Search API docs" --query "Parallel Search API" --dry-run
docpull parallel discover-docs "Find Parallel API docs" \
  --query "Parallel Search API docs" \
  --include-domain docs.parallel.ai \
  --dry-run
docpull parallel extract-pack https://docs.parallel.ai/api-reference/search/search --dry-run
docpull parallel fallback-pack https://docs.parallel.ai/api-reference/search/search --dry-run
docpull parallel task-pack "Research Parallel Search API" \
  --source-include-domain docs.parallel.ai \
  --output-schema-json '{"type":"object","properties":{"summary":{"type":"string"}},"required":["summary"]}' \
  --dry-run

# Entity dossiers, FindAll lifecycle, batch enrichment, monitors, and API specs.
docpull parallel entity-pack "AI developer infrastructure companies" --entity-type companies
docpull parallel findall-pack "AI developer infrastructure companies" --generator preview --dry-run
docpull parallel findall-ingest-pack "Find companies with public API docs"
docpull parallel findall-result-pack findall_123 --output-dir ./packs/findall-result
docpull parallel taskgroup-pack ./companies.json --prompt-template "Research {company}" --dry-run
docpull parallel taskgroup-pack ./companies.json --prompt-template "Research {company}" --wait
docpull parallel monitor-pack create "New vendor pricing changes" --dry-run
docpull parallel monitor-pack create --type snapshot --task-run-id run_123 --dry-run
docpull parallel monitor-pack list --status active --output-dir ./packs/monitors
docpull parallel monitor-pack events monitor_123 --cursor next_cursor --output-dir ./packs/events
docpull parallel api-pack https://docs.parallel.ai/public-openapi.json --output-dir ./packs/parallel-openapi
docpull parallel run ./docs/examples/parallel-llms-api-pack.yaml --dry-run
docpull parallel run ./docs/examples/parallel-openapi-api-pack.yaml --dry-run
docpull pack score ./packs/parallel-openapi
docpull pack sources ./packs/parallel-openapi --require-domain docs.parallel.ai
docpull pack diff ./packs/old ./packs/new --markdown ./packs/changes.md
docpull parallel diff-brief ./packs/old ./packs/new --dry-run

MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it directly over stdio:

pip install 'docpull[mcp]'
docpull mcp  # starts the stdio server

Claude Code:

claude mcp add --transport stdio docpull -- docpull mcp

Cursor (.cursor/mcp.json in a project, or ~/.cursor/mcp.json globally):

{
  "mcpServers": {
    "docpull": {
      "type": "stdio",
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

Claude Desktop uses the same mcpServers shape in claude_desktop_config.json.

Or, if you use Claude Code, install the plugin instead — it bundles the MCP server, five slash commands (/docs-add, /docs-search, /docs-list, /docs-refresh, /docs-remove), and a meta-skill that teaches Claude when to reach for docpull automatically:

# 1. Install docpull with the MCP extra (required for the plugin)
pip install 'docpull[mcp]'
# 2. Then in Claude Code:
/plugin marketplace add raintree-technology/docpull
/plugin install docpull@docpull

See plugin/README.md for details.

Tools exposed (12 total — read tools advertise readOnlyHint so hosts that auto-approve safe tools won't prompt):

Read:

  • fetch_url(url, max_tokens?) — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
  • list_sources(category?) — show available aliases (react, nextjs, parallel, fastapi, …)
  • list_indexed() — what has been fetched locally, with last-fetched age
  • grep_docs(pattern, library?, limit?, context?) — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
  • read_doc(library, path, line_start?, line_end?) — read a specific cached file, optionally line-sliced
  • pack_score(pack_dir, required_domains?) — score a local context pack for agent readiness.
  • pack_diff(old_pack_dir, new_pack_dir) — compare pack URL sets and content hashes.

Write:

  • ensure_docs(source, force?, profile?) — fetch a named library (cached 7 days). Forwards progress to clients that supply a progressToken.
  • parallel_context_pack(...) — build or dry-run a Parallel Search + Extract pack without shelling out.
  • parallel_api_pack(source, kind?, output_dir?) — build a pack from llms.txt or OpenAPI.
  • add_source(name, url, description?, category?, max_pages?, force?) — register a user alias (HTTPS-only, atomic write to sources.yaml).
  • remove_source(name, delete_cache?) — drop a user alias and (optionally) its cached docs.

All schema-backed tools return structuredContent validated against an outputSchema for clients that prefer typed output. fetch_url intentionally returns Markdown text directly.

User-defined sources live in ~/.config/docpull-mcp/sources.yaml:

sources:
  mydocs:
    url: https://docs.example.com
    description: My internal docs
    category: internal
    maxPages: 200

Supported MCP path

The supported MCP server is the Python stdio server started by docpull mcp. That is the only MCP path covered by the docpull package release contract and the one agents, plugin users, Claude Code, Cursor, and Claude Desktop should use.

This repository also contains an mcp/ directory with an internal TypeScript + Bun lab for PostgreSQL/pgvector semantic search. It is not shipped by the Python package, is not documented as a user install path, and should be ignored unless you are explicitly developing that lab.

Output

Markdown files with YAML frontmatter:

---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---

# Getting Started

NDJSON (one record per page or chunk):

{"document_id": "doc_...", "chunk_id": "chunk_...", "url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}

Every output format also writes corpus.manifest.json next to the generated documents. The manifest records the run identity, output format, stable document_id / chunk_id values, content hashes, relative output paths, and chunk counts so regenerated corpora can be diffed and cited by agents.

Security

  • HTTPS-only, mandatory robots.txt compliance
  • SSRF protection: blocks private/internal network IPs, DNS rebinding via connect-time address pinning
  • XXE protection via defusedxml on sitemaps
  • Path traversal and CRLF header injection guards
  • Auth headers stripped on cross-origin redirects

When running with --proxy, DNS pinning is delegated to the proxy. Pass --require-pinned-dns to refuse this configuration and keep the connector- level SSRF guarantees in effect.

Options

Run docpull --help for the full list. Highlights:

Core:
  --profile {rag,mirror,quick,llm}
  --single                Fetch one URL (no crawl)
  --format {markdown,json,ndjson,sqlite}
  --stream                Stream NDJSON to stdout

LLM / chunking:
  --max-tokens-per-file N
  --tokenizer NAME        tiktoken encoding (default cl100k_base)
  --emit-chunks           One file/record per chunk

Content extraction:
  --extractor {default,trafilatura}
  --no-special-cases      Disable framework extractors
  --strict-js-required    Error on JS-only pages

Cache:
  --cache                 Enable incremental updates
  --cache-dir DIR
  --cache-ttl DAYS

Crawl:
  --max-concurrent N      Global request concurrency
  --per-host-concurrent N Per-host request concurrency

Performance

End-to-end numbers from tests/benchmarks/test_10k_pages.py against a synthetic 10,000-page localhost site (RAG profile, max_concurrent=50, per_host_concurrent=50, HTTP keep-alive, 5% injected duplicate content). The benchmark emits progress every 1,000 pages plus a final JSON report for trend tooling. Wall time varies by machine; the value below is the latest local audit run.

Metric Value
Total wall time ~309 s
Pages fetched / skipped / failed 9,501 / 499 / 0
Time to first saved page ~119 ms
Per-page latency p50 / p95 / p99 ~0 / 168 / 271 ms
Peak RSS delta from baseline ~93 MB
Cache manifest size on disk ~8.9 MB
Duplicates detected (5% injected) 499 / 500

Reproduce with make benchmark (requires aiohttp; runs the gated benchmark in tests/benchmarks/ and prints progress plus a JSON line you can pipe into trend tooling).

Troubleshooting

docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading
docpull URL --preview-urls    # List URLs without fetching

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-4.1.0.tar.gz (235.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docpull-4.1.0-py3-none-any.whl (205.0 kB view details)

Uploaded Python 3

File details

Details for the file docpull-4.1.0.tar.gz.

File metadata

  • Download URL: docpull-4.1.0.tar.gz
  • Upload date:
  • Size: 235.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for docpull-4.1.0.tar.gz
Algorithm Hash digest
SHA256 8238d6f3efd3df2cb246c94241c207862a1f3ea2f6b5104893d508f2fc3f157a
MD5 bcf9adaad2062795d276386bb3296c18
BLAKE2b-256 c02a23335122a48b36a0fc89e71a15e77b44cb14aeb4d8baa14b989ff188aeff

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-4.1.0.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docpull-4.1.0-py3-none-any.whl.

File metadata

  • Download URL: docpull-4.1.0-py3-none-any.whl
  • Upload date:
  • Size: 205.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for docpull-4.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2fd0ef2502820b9216d9aaf7572085c587a93489e385050f23f3cf1c1132785d
MD5 618b05314f7ba47649a49df492f1938b
BLAKE2b-256 71053d9d1548eab69184eecc48603766ab7fcd1863cb6c431fa796c4b550177d

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-4.1.0-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page