Skip to main content

Fetch web pages as clean Markdown for LLM agents. HTTP-first, optional Chromium rendering, CLI + Python + MCP.

Project description

pulldown

Pull down web pages as clean Markdown for LLM agents.

  • HTTP-first with browser-like defaults
  • Optional Chromium rendering for JS-heavy pages
  • Four detail levels: minimal, readable, full, raw
  • Concurrent batch fetching with fetch_many()
  • Bounded site crawling with robots.txt support and per-domain politeness
  • Validator-based caching (ETag / Last-Modified) with atomic writes
  • SSRF guards: private/loopback/metadata addresses blocked by default
  • Response size caps and transient-error retries
  • CLI, Python API, and MCP server

Install

pip install pulldown                 # core
pip install 'pulldown[render]'       # + Playwright (Chromium rendering)
pip install 'pulldown[mcp]'          # + MCP server
pip install 'pulldown[all]'          # everything

For rendered pages, also run playwright install chromium once.

Quick Start

CLI

pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render --scroll 3
pulldown crawl https://docs.example.com --max-pages 20 --delay-ms 200
pulldown bench https://example.com --runs 5
pulldown cache stats

Python

import asyncio
from pulldown import fetch, fetch_many, crawl, Detail, PageCache

async def main():
    # Single fetch
    result = await fetch("https://example.com", detail=Detail.readable)
    print(result.title, result.content)

    # Batch fetch with caching
    cache = PageCache(ttl=3600)
    results = await fetch_many(
        ["https://a.com", "https://b.com"],
        concurrency=5,
        cache=cache,
        retries=2,
    )

    # Crawl a docs site
    crawl_result = await crawl(
        "https://docs.example.com/",
        max_pages=50,
        max_depth=2,
        respect_robots=True,
        per_domain_delay_ms=200,
    )
    markdown = crawl_result.to_markdown()

asyncio.run(main())

MCP

Add to your client config (e.g. Claude Desktop):

{
  "mcpServers": {
    "pulldown": {
      "command": "python",
      "args": ["-m", "pulldown.mcp_server"],
      "env": {
        "PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
      }
    }
  }
}

Environment variables:

Variable Default Purpose
MCP_TRANSPORT stdio stdio or http
MCP_HOST 127.0.0.1 Bind address for HTTP transport
MCP_PORT 8080 Port for HTTP transport
PULLDOWN_CACHE_DIR unset Enable caching to this directory
PULLDOWN_CACHE_TTL 3600 Cache TTL in seconds
PULLDOWN_ALLOW_PRIVATE 0 Set to 1 to allow private addresses

Detail Levels

Level Output Best for
minimal Title + plain text Lowest-token summarisation
readable Article Markdown with links RAG, reading (default)
full Full-page Markdown incl. chrome Pages without clear article body
raw Untouched HTML Custom parsing downstream

Security

pulldown refuses to fetch URLs that resolve to private, loopback, link-local, or cloud-metadata addresses by default. This prevents LLM-driven SSRF into internal services (e.g., AWS metadata at 169.254.169.254, Redis on localhost:6379). Override with allow_private_addresses=True if you understand the risk.

Responses above 10 MiB are rejected by default (max_bytes parameter).

Only http and https schemes are accepted; file:, ftp:, etc. are rejected.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulldown-0.2.0.tar.gz (96.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pulldown-0.2.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file pulldown-0.2.0.tar.gz.

File metadata

  • Download URL: pulldown-0.2.0.tar.gz
  • Upload date:
  • Size: 96.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pulldown-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ade82a5f72c68d3433d3c4992005e4b43c377f04a5b23a2b5fbb23b7e6e32eea
MD5 c1becd1dfb72906aaacff20387c64305
BLAKE2b-256 45a8f07ae3b7ab52d93f51ca748edb02286b98e23dc2aec6d5dfa708b1ef5671

See more details on using hashes here.

Provenance

The following attestation bundles were made for pulldown-0.2.0.tar.gz:

Publisher: publish.yml on anthony-maio/pulldown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pulldown-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pulldown-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pulldown-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e42961757de947d60f150603c57e78871263bb93d1a66b0b1a70db7944d6fb1e
MD5 4281aafe17f1ca7701c3244a445cde30
BLAKE2b-256 23712643f74bacb159a2d533d36f267999cf2e740254d319805bfd4b1d68ef33

See more details on using hashes here.

Provenance

The following attestation bundles were made for pulldown-0.2.0-py3-none-any.whl:

Publisher: publish.yml on anthony-maio/pulldown

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page