Fetch web pages as clean Markdown for LLM agents. HTTP-first, optional Chromium rendering, CLI + Python + MCP.
Project description
pulldown
Pull down web pages as clean Markdown for LLM agents.
- HTTP-first with browser-like defaults
- Optional Chromium rendering for JS-heavy pages
- Four detail levels:
minimal,readable,full,raw - Core installs decode Brotli-compressed pages correctly
- Concurrent batch fetching with
fetch_many() - Bounded site crawling with
robots.txtsupport and per-domain politeness - Validator-based caching (ETag / Last-Modified) with atomic writes
- SSRF guards: private/loopback/metadata addresses blocked by default
- Response size caps and transient-error retries
- CLI, Python API, and MCP server
Install
pip install pulldown # core
pip install 'pulldown[render]' # + Playwright (Chromium rendering)
pip install 'pulldown[mcp]' # + MCP server
pip install 'pulldown[all]' # everything
Core installs include Brotli support, so br-compressed HTML is decoded before
minimal, readable, full, or raw processing.
Core installs also include lxml_html_clean, avoiding the missing-helper import
issue some agent sandboxes hit on older releases.
For rendered pages, also run playwright install chromium once.
Quick Start
CLI
pulldown get https://example.com
pulldown get https://example.com --detail minimal
pulldown get https://example.com --render --scroll 3
pulldown crawl https://docs.example.com --max-pages 20 --delay-ms 200
pulldown bench https://example.com --runs 5
pulldown cache stats
Python
import asyncio
from pulldown import fetch, fetch_many, crawl, Detail, PageCache
async def main():
# Single fetch
result = await fetch("https://example.com", detail=Detail.readable)
print(result.title, result.content)
# Batch fetch with caching
cache = PageCache(ttl=3600)
results = await fetch_many(
["https://a.com", "https://b.com"],
concurrency=5,
cache=cache,
retries=2,
)
# Crawl a docs site
crawl_result = await crawl(
"https://docs.example.com/",
max_pages=50,
max_depth=2,
respect_robots=True,
per_domain_delay_ms=200,
)
markdown = crawl_result.to_markdown()
asyncio.run(main())
MCP
Add to your client config (e.g. Claude Desktop):
{
"mcpServers": {
"pulldown": {
"command": "python",
"args": ["-m", "pulldown.mcp_server"],
"env": {
"PULLDOWN_CACHE_DIR": "~/.cache/pulldown"
}
}
}
}
Environment variables:
| Variable | Default | Purpose |
|---|---|---|
MCP_TRANSPORT |
stdio |
stdio or http |
MCP_HOST |
127.0.0.1 |
Bind address for HTTP transport |
MCP_PORT |
8080 |
Port for HTTP transport |
PULLDOWN_CACHE_DIR |
unset | Enable caching to this directory |
PULLDOWN_CACHE_TTL |
3600 |
Cache TTL in seconds |
PULLDOWN_ALLOW_PRIVATE |
0 |
Set to 1 to allow private addresses |
Detail Levels
| Level | Output | Best for |
|---|---|---|
minimal |
Title + plain text | Lowest-token summarisation |
readable |
Clean Markdown with links | RAG, reading, structured landing pages (default) |
full |
Full-page Markdown incl. chrome | Pages without clear article body |
raw |
Untouched HTML | Custom parsing downstream |
Security
pulldown refuses to fetch URLs that resolve to private, loopback,
link-local, or cloud-metadata addresses by default. This prevents
LLM-driven SSRF into internal services (e.g., AWS metadata at
169.254.169.254, Redis on localhost:6379). Override with
allow_private_addresses=True if you understand the risk.
Responses above 10 MiB are rejected by default (max_bytes parameter).
Only http and https schemes are accepted; file:, ftp:, etc. are
rejected.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pulldown-0.3.1.tar.gz.
File metadata
- Download URL: pulldown-0.3.1.tar.gz
- Upload date:
- Size: 104.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e32a46b88dfc8133429b015eb147f1883b5d5301b43ec5779647c3a01f6e2cdc
|
|
| MD5 |
18e85cf6195557e71918e24ed185aa1e
|
|
| BLAKE2b-256 |
642421d4eab0ecb8dc8371ef1d7e89e6cde6501a93128cdf904f9c53b7599f8a
|
File details
Details for the file pulldown-0.3.1-py3-none-any.whl.
File metadata
- Download URL: pulldown-0.3.1-py3-none-any.whl
- Upload date:
- Size: 28.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
553486287458e10054cad7683cea85acd3fca0c674533ed35bef63d9e816ab44
|
|
| MD5 |
99ca8cecbb2168059462c1a2905d81c7
|
|
| BLAKE2b-256 |
2d9369edf909519a7d895c6f7017d0d6afd8222e9d28cf2da6739a2ca9548975
|