Skip to main content

A CLI tool to convert browser rendered HTTP content to Markdown with playwright

Project description

http2md

A CLI tool to fetch web pages and convert them to Markdown using Playwright.

Installation

pip install http2md
http2md install

Docker

You can use http2md via Docker without installing Python or system dependencies.

  1. Build the image (first time only):

    docker-compose build
    
  2. Run the crawler:

    # Crawl and save to ./out_docker/
    docker-compose run --rm http2md https://example.com --outdir out
    
    • The ./out_docker directory on your host is mounted to /app/out inside the container.
    • Command arguments (--depth, --tqdm, etc.) are passed directly to http2md.

Usage

# Basic usage (converts to Markdown)
http2md https://example.com

# Basic usage out to file (converts to Markdown)
http2md https://example.com -o output.md

# Output raw HTML
http2md https://example.com --html

# Wait for a specific element before extracting
http2md https://spa-site.com --wait-for ".content"

# Increase timeout for slow sites (default: 30000ms)
http2md https://slow-site.com --timeout 60000

# Use specific wait strategy
http2md https://fast-site.com --wait-until load

# SYNC Crawl site to depth 2, save to ./docs/
http2md https://react.dev  --depth 2 --outdir ./docs

# ASYNC Increase concurrency to 10
http2md async https://react.dev --jobs 10 --outdir ./docs

CLI Options

usage: http2md [-h] [--html]
               [--wait-until {auto,load,domcontentloaded,networkidle,commit}]
               [--timeout TIMEOUT] [--wait-for WAIT_FOR] [-o OUT]
               [url]

Convert HTTP content to Markdown. Supports:
- Headings, lists, code blocks, tables
- Links (static and dynamic)
- Images (with alt text)
- Formatting (bold, italic, **strikethrough**)

positional arguments:
  url                   URL to process

options:
  -h, --help            show this help message and exit
  --html                Output raw HTML instead of Markdown
  --wait-until          Wait strategy (default: auto)
  --timeout TIMEOUT     Timeout in milliseconds (default: 30000)
  --wait-for WAIT_FOR   CSS selector to wait for before extracting content
  -o, --out OUT         Output file path

Wait Strategies

Strategy Description
auto Combined: tries networkidle, falls back on timeout (default)
load Wait for load event
domcontentloaded Wait for DOM to be ready
networkidle Wait for no network activity (500ms)
commit Return immediately after response headers

Browser Presets

For sites with anti-bot protection (Cloudflare, etc.), use --preset:

# Sites with basic bot detection
http2md https://protected-site.com --preset stealth

# Cloudflare-protected sites (adds extra wait for JS challenge)
http2md https://cloudflare-site.com --preset cloudflare

# Crawling with preset
http2md https://protected-site.com --depth 2 --preset cloudflare --outdir ./out

# Async crawling with preset
http2md async https://protected-site.com --depth 2 --preset cloudflare -j 5
Preset User-Agent Viewport Extra Wait Use Case
stealth Chrome/Windows 1920×1080 0s Basic bot detection
cloudflare Chrome/Windows 1920×1080 5s Cloudflare JS challenge

Python API

You can also use http2md directly from Python:

from http2md.crawler import fetch_html
from markdownify import markdownify as md

# Fetch raw HTML
html = fetch_html("https://example.com")

# Convert to Markdown
markdown = md(html)
print(markdown)

# With options
html = fetch_html(
    "https://spa-site.com",
    wait_until="networkidle",  # or "auto", "load", "domcontentloaded"
    timeout=60000,             # 60 seconds
    wait_for=".content"        # CSS selector to wait for
)

Site Crawling

Crawl entire websites to a specified depth:

# Crawl site to depth 2, save to ./docs/
http2md https://react.dev  --depth 2 --outdir ./docs

# Only crawl /api/* pages
http2md https://react.dev  --depth 3 --include "/api/*"

# Exclude images and static files
http2md https://react.dev  --depth 2 --exclude "*.png" --exclude "*.css"

# Quiet mode (no progress output)
http2md https://react.dev  --depth 1 --outdir ./out -q

Parallel Crawling (Fast Mode)

Use the async command to enable parallel downloading (up to 5-10x faster):

# Run with 5 concurrent jobs (default)
http2md async https://react.dev  --depth 2 --outdir ./docs

# Increase concurrency to 10
http2md async https://react.dev --jobs 10 --outdir ./docs
  • Note: This mode uses asyncio and reuses the browser instance, making it much faster but potentially less stable on extremely complex sites.
  • Standard mode (http2md <url>) remains synchronous and uses a fresh browser for every page (slower but maximum isolation/reliability).

Why use Async Mode?

The async implementation (crawler_async.py) is designed for performance:

  1. Architecture: Uses asyncio and playwright.async_api.
  2. Resource Efficiency: Reuses a single BrowserContext across multiple pages instead of launching a new browser for every URL.
  3. Concurrency: Uses a worker pool to fetch multiple pages in parallel (controlled by --jobs).
  4. Speed: Can be 5-10x faster than the synchronous mode, especially on larger sites.

Crawling Options

Option Description
--depth N Crawl depth (0=single page, 1=links from page, etc.)
--outdir DIR Output directory for crawled pages
--include PATTERN Include URLs matching glob pattern (repeatable)
--exclude PATTERN Exclude URLs matching glob pattern (repeatable)
--no-same-domain Allow following links to other domains
--tqdm Use tqdm progress bar
-q, --quiet Suppress progress output

Advanced Link Extraction

http2md automatically handles Single Page Applications (SPAs) and dynamic content:

  1. JavaScript Execution: It executes JavaScript to render the page fully.
  2. Auto-Scrolling: It automatically attempts to scroll to the bottom of the page to trigger lazy-loading of content.
  3. Dynamic Links: It extracts links from the rendered DOM (using page.evaluate), not just the static HTML. This ensures links generated by JavaScript are found.

Note: Sites using non-standard navigation (e.g., onclick on div elements instead of <a> tags) may still have limited crawlability.

Python API for Crawling

from http2md.crawler_site import crawl_site

def on_progress(url, status, current, total, html=None, markdown=None):
    print(f"[{current}/{total}] {status}: {url}")
    if html:
        print(f"  Downloaded {len(html)} bytes")

results = crawl_site(
    "https://react.dev",
    depth=2,
    outdir="./output",
    callback=on_progress,
    include=["*/api/*"],
    exclude=["*.png"]
)

Using tqdm for Progress

from http2md.crawler_site import crawl_site
from tqdm import tqdm

pbar = tqdm(unit="pages")

def tqdm_callback(url, status, current, total, html=None, markdown=None):
    pbar.total = total
    if status == "fetching":
        pbar.set_description(f"Fetching {url[:50]}")
    elif status == "done" or status.startswith("skipped"):
        pbar.update(1)
    pbar.refresh()

crawl_site(
    "https://docs.example.com",
    depth=2,
    callback=tqdm_callback
)
pbar.close()
)
pbar.close()

Python API (Async)

For maximum performance in your own scripts, use crawl_site_async:

import asyncio
from http2md.crawler_async import crawl_site_async

async def main():
    results = await crawl_site_async(
        "https://react.dev",
        depth=2,
        jobs=10,  # 10 concurrent requests
        outdir="./output_async"
    )
    print(f"Crawled {len(results)} pages")

if __name__ == "__main__":
    asyncio.run(main())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

http2md-0.9.1.tar.gz (17.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

http2md-0.9.1-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file http2md-0.9.1.tar.gz.

File metadata

  • Download URL: http2md-0.9.1.tar.gz
  • Upload date:
  • Size: 17.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for http2md-0.9.1.tar.gz
Algorithm Hash digest
SHA256 129f3bc0427e7a1169247976c099cdf0ad2b50e803de34258303df9ff84b0a82
MD5 d9d6e26777f5ca81dd999f98e7ed260c
BLAKE2b-256 61f75ca9e7c178fd36236d2746bd947efedaa5e1cd40a4b5b25a0246e704e771

See more details on using hashes here.

File details

Details for the file http2md-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: http2md-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for http2md-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8465ecf9029219aeabe4b8933e96d9aa338db4f855b4c286047ea55f94456336
MD5 00e9b595949b4984c90dea88f03c3973
BLAKE2b-256 473b2a81788a922ca3241ed29438e4d581f8569c3442d94c5a2dec61699a00b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page