Skip to main content

Fetch web pages as clean Markdown for LLM agents. HTTP-first, optional Chromium rendering, CLI + Python + MCP.

Project description

pulldown

Pull down web pages as clean Markdown for LLM agents.

  • HTTP-first with browser-like defaults
  • Optional Chromium rendering for JS-heavy pages
  • Four detail levels: minimal, readable, full, raw
  • Core installs decode Brotli-compressed pages correctly
  • Concurrent batch fetching with fetch_many()
  • Bounded site crawling with robots.txt support and per-domain politeness
  • Validator-based caching (ETag / Last-Modified) with atomic writes
  • SSRF guards: private/loopback/metadata addresses blocked by default
  • Response size caps and transient-error retries
  • CLI, Python API, and MCP server

Install

pip install pulldown                 # core
pip install 'pulldown[render]'       # + Playwright (Chromium rendering)
pip install 'pulldown[mcp]'          # + MCP server
pip install 'pulldown[all]'          # everything

Core installs include Brotli support, so br-compressed HTML is decoded before minimal, readable, full, or raw processing.

Core installs also include lxml_html_clean, avoiding the missing-helper import issue some agent sandboxes hit on older releases.

For rendered pages, also run playwright install chromium once.

Quick Start

CLI

pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render --scroll 3
pulldown crawl https://docs.example.com --max-pages 20 --delay-ms 200
pulldown bench https://example.com --runs 5
pulldown cache stats

Python

import asyncio
from pulldown import fetch, fetch_many, crawl, Detail, PageCache

async def main():
    # Single fetch
    result = await fetch("https://example.com", detail=Detail.readable)
    print(result.title, result.content)

    # Batch fetch with caching
    cache = PageCache(ttl=3600)
    results = await fetch_many(
        ["https://a.com", "https://b.com"],
        concurrency=5,
        cache=cache,
        retries=2,
    )

    # Crawl a docs site
    crawl_result = await crawl(
        "https://docs.example.com/",
        max_pages=50,
        max_depth=2,
        respect_robots=True,
        per_domain_delay_ms=200,
    )
    markdown = crawl_result.to_markdown()

asyncio.run(main())

MCP

Add to your client config (e.g. Claude Desktop):

{
  "mcpServers": {
    "pulldown": {
      "command": "python",
      "args": ["-m", "pulldown.mcp_server"],
      "env": {
        "PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
      }
    }
  }
}

Environment variables:

Variable Default Purpose
MCP_TRANSPORT stdio stdio or http
MCP_HOST 127.0.0.1 Bind address for HTTP transport
MCP_PORT 8080 Port for HTTP transport
PULLDOWN_CACHE_DIR unset Enable caching to this directory
PULLDOWN_CACHE_TTL 3600 Cache TTL in seconds
PULLDOWN_ALLOW_PRIVATE 0 Set to 1 to allow private addresses

Detail Levels

Level Output Best for
minimal Title + plain text Lowest-token summarisation
readable Clean Markdown with links RAG, reading, structured landing pages (default)
full Full-page Markdown incl. chrome Pages without clear article body
raw Untouched HTML Custom parsing downstream

Security

pulldown refuses to fetch URLs that resolve to private, loopback, link-local, or cloud-metadata addresses by default. This prevents LLM-driven SSRF into internal services (e.g., AWS metadata at 169.254.169.254, Redis on localhost:6379). Override with allow_private_addresses=True if you understand the risk.

Responses above 10 MiB are rejected by default (max_bytes parameter).

Only http and https schemes are accepted; file:, ftp:, etc. are rejected.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulldown-0.3.1.tar.gz (104.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulldown-0.3.1-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file pulldown-0.3.1.tar.gz.

File metadata

  • Download URL: pulldown-0.3.1.tar.gz
  • Upload date:
  • Size: 104.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pulldown-0.3.1.tar.gz
Algorithm Hash digest
SHA256 e32a46b88dfc8133429b015eb147f1883b5d5301b43ec5779647c3a01f6e2cdc
MD5 18e85cf6195557e71918e24ed185aa1e
BLAKE2b-256 642421d4eab0ecb8dc8371ef1d7e89e6cde6501a93128cdf904f9c53b7599f8a

See more details on using hashes here.

File details

Details for the file pulldown-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: pulldown-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for pulldown-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 553486287458e10054cad7683cea85acd3fca0c674533ed35bef63d9e816ab44
MD5 99ca8cecbb2168059462c1a2905d81c7
BLAKE2b-256 2d9369edf909519a7d895c6f7017d0d6afd8222e9d28cf2da6739a2ca9548975

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page