A CLI tool to convert HTTP content to Markdown
Project description
http2md
A CLI tool to fetch web pages and convert them to Markdown using Playwright.
Installation
pip install http2md
http2md install
Docker
You can use http2md via Docker without installing Python or system dependencies.
-
Build the image (first time only):
docker-compose build -
Run the crawler:
# Crawl and save to ./out_docker/ docker-compose run --rm http2md https://example.com --outdir out
- The
./out_dockerdirectory on your host is mounted to/app/outinside the container. - Command arguments (
--depth,--tqdm, etc.) are passed directly tohttp2md.
- The
Usage
# Basic usage (converts to Markdown)
http2md https://example.com
# Basic usage out to file (converts to Markdown)
http2md https://example.com -o output.md
# Output raw HTML
http2md https://example.com --html
# Wait for a specific element before extracting
http2md https://spa-site.com --wait-for ".content"
# Increase timeout for slow sites (default: 30000ms)
http2md https://slow-site.com --timeout 60000
# Use specific wait strategy
http2md https://fast-site.com --wait-until load
# SYNC Crawl site to depth 2, save to ./docs/
http2md https://react.dev --depth 2 --outdir ./docs
# ASYNC Increase concurrency to 10
http2md async https://react.dev --jobs 10 --outdir ./docs
CLI Options
usage: http2md [-h] [--html]
[--wait-until {auto,load,domcontentloaded,networkidle,commit}]
[--timeout TIMEOUT] [--wait-for WAIT_FOR] [-o OUT]
[url]
Convert HTTP content to Markdown. Supports:
- Headings, lists, code blocks, tables
- Links (static and dynamic)
- Images (with alt text)
- Formatting (bold, italic, **strikethrough**)
positional arguments:
url URL to process
options:
-h, --help show this help message and exit
--html Output raw HTML instead of Markdown
--wait-until Wait strategy (default: auto)
--timeout TIMEOUT Timeout in milliseconds (default: 30000)
--wait-for WAIT_FOR CSS selector to wait for before extracting content
-o, --out OUT Output file path
Wait Strategies
| Strategy | Description |
|---|---|
auto |
Combined: tries networkidle, falls back on timeout (default) |
load |
Wait for load event |
domcontentloaded |
Wait for DOM to be ready |
networkidle |
Wait for no network activity (500ms) |
commit |
Return immediately after response headers |
Python API
You can also use http2md directly from Python:
from http2md.crawler import fetch_html
from markdownify import markdownify as md
# Fetch raw HTML
html = fetch_html("https://example.com")
# Convert to Markdown
markdown = md(html)
print(markdown)
# With options
html = fetch_html(
"https://spa-site.com",
wait_until="networkidle", # or "auto", "load", "domcontentloaded"
timeout=60000, # 60 seconds
wait_for=".content" # CSS selector to wait for
)
Site Crawling
Crawl entire websites to a specified depth:
# Crawl site to depth 2, save to ./docs/
http2md https://react.dev --depth 2 --outdir ./docs
# Only crawl /api/* pages
http2md https://react.dev --depth 3 --include "/api/*"
# Exclude images and static files
http2md https://react.dev --depth 2 --exclude "*.png" --exclude "*.css"
# Quiet mode (no progress output)
http2md https://react.dev --depth 1 --outdir ./out -q
Parallel Crawling (Fast Mode)
Use the async command to enable parallel downloading (up to 5-10x faster):
# Run with 5 concurrent jobs (default)
http2md async https://react.dev --depth 2 --outdir ./docs
# Increase concurrency to 10
http2md async https://react.dev --jobs 10 --outdir ./docs
- Note: This mode uses
asyncioand reuses the browser instance, making it much faster but potentially less stable on extremely complex sites. - Standard mode (
http2md <url>) remains synchronous and uses a fresh browser for every page (slower but maximum isolation/reliability).
Why use Async Mode?
The async implementation (crawler_async.py) is designed for performance:
- Architecture: Uses
asyncioandplaywright.async_api. - Resource Efficiency: Reuses a single
BrowserContextacross multiple pages instead of launching a new browser for every URL. - Concurrency: Uses a worker pool to fetch multiple pages in parallel (controlled by
--jobs). - Speed: Can be 5-10x faster than the synchronous mode, especially on larger sites.
Crawling Options
| Option | Description |
|---|---|
--depth N |
Crawl depth (0=single page, 1=links from page, etc.) |
--outdir DIR |
Output directory for crawled pages |
--include PATTERN |
Include URLs matching glob pattern (repeatable) |
--exclude PATTERN |
Exclude URLs matching glob pattern (repeatable) |
--no-same-domain |
Allow following links to other domains |
--tqdm |
Use tqdm progress bar |
-q, --quiet |
Suppress progress output |
Advanced Link Extraction
http2md automatically handles Single Page Applications (SPAs) and dynamic content:
- JavaScript Execution: It executes JavaScript to render the page fully.
- Auto-Scrolling: It automatically attempts to scroll to the bottom of the page to trigger lazy-loading of content.
- Dynamic Links: It extracts links from the rendered DOM (using
page.evaluate), not just the static HTML. This ensures links generated by JavaScript are found.
Note: Sites using non-standard navigation (e.g., onclick on div elements instead of <a> tags) may still have limited crawlability.
Python API for Crawling
from http2md.crawler_site import crawl_site
def on_progress(url, status, current, total, html=None, markdown=None):
print(f"[{current}/{total}] {status}: {url}")
if html:
print(f" Downloaded {len(html)} bytes")
results = crawl_site(
"https://react.dev",
depth=2,
outdir="./output",
callback=on_progress,
include=["*/api/*"],
exclude=["*.png"]
)
Using tqdm for Progress
from http2md.crawler_site import crawl_site
from tqdm import tqdm
pbar = tqdm(unit="pages")
def tqdm_callback(url, status, current, total, html=None, markdown=None):
pbar.total = total
if status == "fetching":
pbar.set_description(f"Fetching {url[:50]}")
elif status == "done" or status.startswith("skipped"):
pbar.update(1)
pbar.refresh()
crawl_site(
"https://docs.example.com",
depth=2,
callback=tqdm_callback
)
pbar.close()
)
pbar.close()
Python API (Async)
For maximum performance in your own scripts, use crawl_site_async:
import asyncio
from http2md.crawler_async import crawl_site_async
async def main():
results = await crawl_site_async(
"https://react.dev",
depth=2,
jobs=10, # 10 concurrent requests
outdir="./output_async"
)
print(f"Crawled {len(results)} pages")
if __name__ == "__main__":
asyncio.run(main())
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file http2md-0.9.tar.gz.
File metadata
- Download URL: http2md-0.9.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e82450d9feda23d761e9218e3b48338fde9a8ce29c850e4638ff03acb2eedb1
|
|
| MD5 |
7c3e3c12c5ef448a63d109158adc9329
|
|
| BLAKE2b-256 |
5f973e14c7422207c95c2ed5e516e4c95c2a38d7f9e9a878684ee98f94f93d03
|
File details
Details for the file http2md-0.9-py3-none-any.whl.
File metadata
- Download URL: http2md-0.9-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ac2bf3518f6cd55208e9a1623260b04a6169a20262d5b351bd1532dacf430df
|
|
| MD5 |
dca843232383ed0165b26dc4f05f177c
|
|
| BLAKE2b-256 |
b7ffe4b3904370b683574f8fa9c0b778b58a2229927d490537f79b9997929561
|