Convert public web pages into clean Markdown for AI agents and RAG

These details have not been verified by PyPI

Project links

Project description

docpull

Turn public web sources into local Markdown, NDJSON, and agent-ready context packs. Browser-free by default.

docpull is a Python CLI, SDK, and MCP server that fetches public or explicitly authorized static/server-rendered web pages and converts them into clean, auditable local artifacts for LLMs, retrieval-augmented generation (RAG), offline research, and agent workflows.

DocPull is local-first: direct fetching, sitemap/link discovery, extraction, indexing, pack intelligence, and local agent-browser rendering can run with no provider account and no required API spend. Tavily, Exa, Parallel, and cloud renderers are optional escalation paths when local and open-source routes are not enough.

DocPull exposes the same core workflows through CLI, Python SDK, and MCP, with each surface optimized for its user. The Surface Contract defines how those surfaces align and where they intentionally differ.

Web-source ingestion is the core workflow. Documentation is one high-value lane, not the product boundary. It works best on static or server-rendered pages such as blogs, API references, OpenAPI specs, changelogs, vendor pages, product pages, filings, docs sites, and other pages where the useful content is available in HTML or embedded page data.

docpull is browser-free by default. JS-only pages are skipped with a clear reason unless you explicitly opt into the local agent-browser renderer. See Scraping Boundary and Alternatives for the full boundary.

Install

pip install docpull

Install optional extras as needed:

pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # stdio MCP server
pip install 'docpull[serve]'         # local pack JSON server runner
pip install 'docpull[parallel]'      # Parallel context packs
pip install 'docpull[observability]' # Raindrop benchmark tracing
pip install 'docpull[e2b]'           # E2B cloud sandbox renderer SDK
pip install 'docpull[all]'           # all optional extras

Browser rendering is an explicit external extension, not part of the base install. Install an agent-browser compatible CLI separately, put it on PATH, or set DOCPULL_AGENT_BROWSER_BIN=/path/to/agent-browser. Verify the runtime with docpull render --check. Render targets must use HTTPS except for localhost/loopback HTTP during local testing, and DocPull keeps renderer action permissions locked down to HTML retrieval only.

For stronger isolation, cloud runtimes are available explicitly: docpull render URL --runtime vercel uses the Vercel Sandbox CLI and Vercel auth, while docpull render URL --runtime e2b uses the E2B Python SDK and E2B_API_KEY. These are never enabled by default. All runtimes execute the same agent-browser --json renderer contract. Use --cloud-max-estimated-cost to set a local per-render budget guard, and use --cloud-agent-browser-install skip with a prebuilt sandbox/template that already includes agent-browser. For E2B, pass --template or set DOCPULL_E2B_TEMPLATE to use that prebuilt environment.

Free-First Budgets

Use --budget 0 when a run must not make paid-capable provider or cloud calls:

docpull https://docs.example.com --budget 0 -o ./docs/example
docpull discover scan https://docs.example.com -o ./packs/discovery
docpull render https://example.com/app --runtime local --budget 0
docpull providers context-pack "Find official docs" --provider all --dry-run --budget 0 --json
docpull benchmark quick --zero-dollar --target-set zero-dollar --provider all

Under a zero budget, local cache, direct HTTP, sitemap/static-link discovery, local extraction, local indexing, pack analysis, monitors, and local agent-browser rendering remain allowed. Live Tavily, Exa, Parallel, Vercel Sandbox, and E2B calls are blocked before execution. Runs involving a budget or paid-capable route write run.accounting.json with non-secret route, cost, HTTP/cache, browser, and blocked-action metadata.

Use docpull discover scan URL to build a provider-free discovery pack from open site hints: llms.txt, RSS/Atom feeds, OpenAPI specs, sitemap indexes, and public GitHub docs trees. It writes the same candidate_sources.ndjson contract as provider imports and URL/sitemap files, so the next step is still docpull discover select or docpull discover fetch.

When a zero-dollar benchmark or local run is partial, DocPull reports the lowest-friction escalation path before spending money: local --render fallback first, BYOK providers next, and cloud rendering only when local rendering or infrastructure is the blocker. Benchmark reports include suggested commands, estimated paid request counts, and estimated paid cost guards before any provider or cloud call is made.

The zero-dollar benchmark target set is the Phase 2 measurement matrix. It keeps the existing docs/provider targets and adds JS-heavy docs, pricing, filings, feeds, sitemaps, and search-to-evidence tasks. The report classifies each target as complete_for_0, complete_with_local_browser, partial_for_0, requires_provider, requires_cloud_browser, or blocked_by_policy.

Open Source And Hosted Boundary

The open-source package owns local fetching, local rendering adapters, provider-free discovery, extraction, indexing, packs, diffs, monitors, MCP, BYOK providers, budget policy, accounting, and benchmarks.

A hosted DocPull product, if offered, should sell managed execution: always-on schedules, browser/proxy infrastructure, persistent auth profiles, queues, alerts, dashboards, collaboration, retention, SSO, audit logs, SLAs, and bundled provider billing. The hosted boundary does not change the OSS default: no hidden paid calls, no CAPTCHA bypass, no stealth scraping, and no claim of a proprietary web-scale index.

30-Second Usage

docpull https://www.python.org/blogs/ --single -o ./python-news

Example output:

python-news/
  index.md
  corpus.manifest.json

Markdown includes source metadata and readable page content:

---
title: "Blogs"
source: https://www.python.org/blogs/
source_type: "html"
---

# Blogs

News from the Python Software Foundation, Python core developers, and the
wider Python community.

Stream chunked NDJSON for agents and RAG:

docpull https://www.python.org/blogs/ \
  --single \
  --profile llm \
  --stream | jq .

Each line is a JSON document:

{"schema_version":1,"document_id":"doc_...","chunk_id":"chunk_...","url":"https://www.python.org/blogs/","title":"Blogs","content":"News from the Python Software Foundation...","source_type":"html","chunk_index":0,"token_count":842}

Common Workflows

# Crawl a public web section and write Markdown files
docpull https://www.python.org/blogs/ -o ./python-news

# Stream LLM-ready NDJSON chunks from a source
docpull https://www.python.org/blogs/ --profile llm --stream | jq .

# Write SQLite with an FTS5 search index
docpull https://www.python.org/blogs/ --format sqlite -o ./python-news-db

# Build an Open Knowledge Format (OKF) bundle for portable source packs
docpull https://example.com --profile okf -o ./site-okf

# Turn a source corpus into agent-ready skills/rules
docpull https://sdk.vercel.ai \
  --skill vercel-ai \
  --skill-agent all \
  --skill-description "Vercel AI SDK source reference"

Local-first parity workflows mirror common hosted search/extract/crawl/research API shapes while writing auditable files instead of relying on a hosted index:

# Normalize candidate URLs without fetching content
docpull map urls ./urls.txt -o ./packs/map

# Extract known URLs into a local pack
docpull extract-pack ./urls.txt -o ./packs/extract

# Select mapped candidates and fetch them
docpull crawl-pack ./packs/map --select top:10 -o ./packs/crawl

# Answer/research from an existing local pack with lifecycle artifacts
docpull research-pack ./packs/crawl \
  --objective "Summarize auth and webhook behavior" \
  --schema ./output.schema.json

# Build a cited entity/list pack from existing evidence
docpull entities-pack ./packs/crawl --limit 100

More examples live in CLI Recipes.

With an explicit --skill-agent, docpull stores the scraped corpus under .docpull/skills/<name>/references and creates agent-specific wrappers that point at that corpus. --skill-agent claude writes a Claude Code skill under .claude/skills/<name>/, --skill-agent codex writes a Codex skill under .agents/skills/<name>/ with agents/openai.yaml, and --skill-agent cursor writes a Cursor project rule at .cursor/rules/<name>.mdc. Use --skill-agent all to create all three. If you pass --output-dir, docpull stages the generated corpus there; explicit --skill-agent targets still write their active agent wrappers.

Use docpull when you need to:

Convert public web sources - docs, blogs, API references, vendor pages, product pages, changelogs, filings, and OpenAPI specs - into Markdown or chunked NDJSON for LLM and RAG pipelines.
Give an agent a local tool for fetching, caching, grepping, and reading web sources.
Build repeatable context packs with stable IDs, hashes, manifests, and source metadata.
Mirror public web content for offline work while preserving attribution.

Why docpull?

docpull is designed for agent and RAG workflows, not just downloading pages.

Need	docpull gives you
Clean Markdown	Article-focused extraction with source metadata
LLM chunks	NDJSON streaming and optional token-aware chunking
Repeatability	Stable document IDs, chunk IDs, hashes, and manifests
Offline work	Cached archives and mirrored source artifacts
Agent access	Local CLI, Python SDK, and stdio MCP server
Downstream exports	JSONL, Sheets CSV/TSV, n8n JSON, Vercel AI JSON, CrewAI JSON, warehouse NDJSON, optional Parquet, and agent skills
Safer fetching	HTTPS defaults, robots.txt compliance, SSRF protections, and redirect guards

Supported Sources

docpull uses async HTTP instead of browser automation by default and includes special handling for common web, documentation, and API surfaces.

Source shape	Support
Static HTML / SSR pages	Extracts article, main, or document regions
Next.js / Mintlify	Parses static HTML and `__NEXT_DATA__` when available
OpenAPI / Swagger	Renders specs into Markdown
Docusaurus / Sphinx / MkDocs	Extracts static article or document regions
VitePress / VuePress / Astro Starlight	Extracts static content regions
GitBook / ReadMe.io	Extracts available article or content regions
Redoc / Scalar	Extracts static API reference regions
JS-only apps	Skipped unless useful content is present in HTML or embedded data

Use --strict-js-required when an agent should treat JS-only pages as hard errors instead of normal skips.

Output Formats

Output	Use it for
Markdown	Local readable source snapshots with YAML frontmatter
NDJSON	Streamed records or chunked records for agents and RAG
SQLite	Local retrieval with an FTS5 index
OKF	Portable Open Knowledge Format bundles with indexes and manifests
Archive / mirror	Cached offline source snapshots

Every file-backed run writes corpus.manifest.json with stable document IDs, chunk IDs, hashes, output paths, and chunk counts. See Corpus Manifest.

Profiles

docpull https://site.com --profile rag        # Default. Dedup + metadata.
docpull https://site.com --profile llm        # NDJSON chunks for agents/RAG.
docpull https://site.com --profile okf        # Portable Open Knowledge Format bundle.
docpull https://site.com --profile mirror     # Cached archive.
docpull https://site.com --profile quick      # Small sampling crawl.
docpull https://site.com --profile sec-filing # EDGAR-friendly evidence chunks.

Run docpull --help for the full option list.

When Not to Use docpull

docpull intentionally does not use a browser unless rendering is explicitly enabled. It is not the right tool for:

JS-only pages that require complex browser workflows beyond static rendered HTML.
Authenticated dashboards or private apps.
Pages behind CAPTCHA or bot challenges.
Workflows that require clicking, scrolling, or browser state.

For those cases, use browser automation, such as Playwright, then pass rendered HTML or exported content into your pipeline. For simple public JS-rendered pages, docpull render and --render fallback provide an explicit local fallback without changing the default crawler behavior. The fallback requires the optional external agent-browser backend.

How It Compares

Tool type	Best for	Tradeoff
`wget` / site mirroring	Downloading raw files	Not agent/RAG-oriented
Browser automation	JS-heavy pages and interactions	Slower, heavier, more stateful
Hosted extraction APIs	Managed extraction at scale	External dependency and cost
docpull	Local public web-source extraction and context packs	No JavaScript rendering by default

Python SDK

from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title)
print(ctx.markdown[:500])

import asyncio
from docpull import Fetcher, DocpullConfig, EventType, ProfileName

async def main():
    cfg = DocpullConfig(url="https://example.com/blog", profile=ProfileName.LLM)
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")

asyncio.run(main())

MCP Server

docpull can run as a stdio MCP server for agent clients:

pip install 'docpull[mcp]'
docpull mcp

Claude Code:

claude mcp add --transport stdio docpull -- docpull mcp

Cursor and Claude Desktop use the same mcpServers shape:

{
  "mcpServers": {
    "docpull": {
      "type": "stdio",
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}

The supported MCP path is the Python stdio server started by docpull mcp. The repository's mcp/ directory is an internal TypeScript/Bun lab and is not part of the package release contract.

Advanced Workflows

docpull[parallel] can discover, extract, enrich, score, diff, and archive live web sources with your own Parallel API key. See Parallel Integration.
Local pack intelligence can build citation maps, extract cited entities, search pack records, write provider-free research briefs, build cited source graphs, or prepare the full sidecar bundle with docpull pack citations, docpull pack entities, docpull pack search, docpull pack brief, docpull graph build, docpull graph query, and docpull pack prepare.
Local-first expansion commands add policy files, discovery packs, refresh reports, audits, cited answers, exports, a localhost pack server, explicit rendering, authenticated-source checks, and cron-friendly monitors: docpull policy, docpull discover, docpull refresh, docpull pack audit, docpull answer-pack, docpull export, docpull serve, docpull share, docpull render, docpull auth check, and docpull monitor.
docpull export writes local files for OpenAI vector JSONL, LangChain, LlamaIndex, DSPy, Sheets CSV/TSV, n8n workflow JSON, Vercel AI SDK JSON, CrewAI JSON, warehouse NDJSON, optional Parquet via docpull[parquet], and Codex/Claude/Cursor agent references.
Optional provider workflows can use Parallel, Tavily, and Exa when configured. Tavily and Exa are available through docpull providers ... and first-class aliases such as docpull tavily context-pack, docpull exa context-pack, docpull exa extract-pack, and docpull tavily map-pack. Use docpull providers capabilities to see the shared baseline and provider-only extensions. For agent or CI logs, use docpull providers auth --json --require-ready --redact-paths for offline local readiness, then docpull providers probe --json --require-verified --redact-paths when explicit live key validation is intended. Successful provider context-pack runs are post-processed into the same local pack intelligence artifacts. See CLI Recipes.
SEC filing evidence packs use rule profiles such as vendor-dependency-rules.yml.

Security Defaults

HTTPS-only fetching with robots.txt compliance.
SSRF protections, private network blocking, DNS rebinding protection, and connect-time address pinning.
XXE protection for sitemaps.
Path traversal and CRLF header injection guards.
Auth headers stripped on cross-origin redirects.

When running with --proxy, DNS pinning is delegated to the proxy. Pass --require-pinned-dns to refuse that configuration.

Troubleshooting

docpull --doctor
docpull render --check
docpull URL --verbose
docpull URL --dry-run
docpull URL --preview-urls

Documentation

CLI Recipes - common commands and advanced workflows.
Scraping Boundary - what docpull does and does not fetch.
Alternatives - when to use browser automation or hosted extraction.
Corpus Manifest - stable IDs, hashes, and source maps.
Surface Contract - how the CLI, Python SDK/API, and MCP surfaces align.
Parallel Integration - live-source context pack workflows.
Changelog - release history.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

5.0.2

Jun 24, 2026

5.0.1

Jun 24, 2026

5.0.0

Jun 22, 2026

4.4.1

Jun 17, 2026

4.4.0

Jun 17, 2026

4.3.1

Jun 15, 2026

4.3.0

Jun 15, 2026

4.2.0

Jun 8, 2026

4.1.0

Jun 8, 2026

4.0.1

Jun 6, 2026

4.0.0

Jun 4, 2026

3.0.2

May 29, 2026

3.0.1

May 29, 2026

3.0.0

Apr 27, 2026

2.5.1

Apr 26, 2026

2.5.0

Apr 26, 2026

2.4.0

Apr 26, 2026

2.3.0

Apr 24, 2026

2.2.0

Dec 15, 2025

2.0.0

Nov 29, 2025

1.5.0

Nov 28, 2025

1.3.0

Nov 20, 2025

1.2.1

Nov 17, 2025

1.2.0

Nov 16, 2025

1.1.0

Nov 14, 2025

1.0.2

Nov 14, 2025

1.0.1

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docpull-5.0.2.tar.gz (488.2 kB view details)

Uploaded Jun 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docpull-5.0.2-py3-none-any.whl (416.7 kB view details)

Uploaded Jun 24, 2026 Python 3

File details

Details for the file docpull-5.0.2.tar.gz.

File metadata

Download URL: docpull-5.0.2.tar.gz
Upload date: Jun 24, 2026
Size: 488.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for docpull-5.0.2.tar.gz
Algorithm	Hash digest
SHA256	`23da7477bdde3483b46936d9f475121bb4a30b31ea2ae1992a43d8daf91c6e6f`
MD5	`25a9b85d16347479b56a988ad33fae94`
BLAKE2b-256	`5f20e893cb4ede40ba028c2b4c475ebc86d770fde1a53128d1daebd8ab7c35e4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-5.0.2.tar.gz:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-5.0.2.tar.gz
- Subject digest: 23da7477bdde3483b46936d9f475121bb4a30b31ea2ae1992a43d8daf91c6e6f
- Sigstore transparency entry: 1935292283
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: raintree-technology/docpull@2e9898237f0526df16d01cf2f6b9356f8b2568c1
- Branch / Tag: refs/tags/v5.0.2
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e9898237f0526df16d01cf2f6b9356f8b2568c1
- Trigger Event: push

File details

Details for the file docpull-5.0.2-py3-none-any.whl.

File metadata

Download URL: docpull-5.0.2-py3-none-any.whl
Upload date: Jun 24, 2026
Size: 416.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for docpull-5.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`975e2a1f4bf61df7fe48429cf48c71551a9c0243b7536756ee095523946fa0a5`
MD5	`80d35133773238b41b97fe7f3501a73a`
BLAKE2b-256	`fa7806a88f34804357573da4fd9b40b24219137f134cdd3485eb5f6b95350009`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docpull-5.0.2-py3-none-any.whl:

Publisher: publish.yml on raintree-technology/docpull

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docpull-5.0.2-py3-none-any.whl
- Subject digest: 975e2a1f4bf61df7fe48429cf48c71551a9c0243b7536756ee095523946fa0a5
- Sigstore transparency entry: 1935292312
- Sigstore integration time: Jun 24, 2026
Source repository:
- Permalink: raintree-technology/docpull@2e9898237f0526df16d01cf2f6b9356f8b2568c1
- Branch / Tag: refs/tags/v5.0.2
- Owner: https://github.com/raintree-technology
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@2e9898237f0526df16d01cf2f6b9356f8b2568c1
- Trigger Event: push

docpull 5.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

docpull

Install

Free-First Budgets

Open Source And Hosted Boundary

30-Second Usage

Common Workflows

Why docpull?

Supported Sources

Output Formats

Profiles

When Not to Use docpull

How It Compares

Python SDK

MCP Server

Advanced Workflows

Security Defaults

Troubleshooting

Documentation

Links

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance