html to markdown
Project description
HtmlQuill
Convert HTML or a URL to Markdown.
Installation
pip install htmlquill
Optional Playwright backend:
pip install "htmlquill[browser]"
playwright install chromium
CLI usage
# Auto-save using the first Markdown heading
htmlquill convert https://example.com/article
# Manual output path
htmlquill convert https://example.com/article -o article.md
# Preview generated filename without saving
htmlquill convert https://example.com/article --filename-only
# Print Markdown content without saving
htmlquill convert https://example.com/article --stdout
# Save generated filename to a target directory
htmlquill convert https://example.com/article --output-dir notes
# Limit generated filename stem length
htmlquill convert https://example.com/article --filename-max-length 60
# Inspect effective config
htmlquill config show https://example.com
# Initialize config and inspect paths
htmlquill config init
htmlquill config path
# Run diagnostics
htmlquill doctor
# Count generated Markdown structure
htmlquill analyse example.md
# Preview Markdown in the terminal
htmlquill preview example.md
htmlquill SOURCE is retained as shorthand for htmlquill convert SOURCE; it now follows the same auto-save behavior unless --stdout is used.
Command overview
htmlquill convert SOURCE [options]htmlquill config path|show|init|validatehtmlquill auth path|show|inithtmlquill doctor [--url URL] [--fetch] [--json] [--strict]htmlquill analyse SOURCE(alias:htmlquill analyze SOURCE)htmlquill preview SOURCE
Convert options
| Option | Description |
|---|---|
SOURCE |
URL (https://...), HTML file path, or - for stdin |
-o, --output PATH |
Manual output file path. Overrides generated filename. |
--stdout |
Print converted Markdown to stdout and do not save. |
--filename-only |
Print resolved output filename and do not save. |
--filename-max-length N |
Max generated filename stem length, excluding .md. Default: 80. |
--output-dir DIR |
Directory for generated output files. Default: current directory. |
--force |
Overwrite generated output target instead of adding a numeric suffix. |
--timeout |
HTTP timeout override in seconds |
--user-agent |
Custom HTTP User-Agent header |
--browser |
Fetching mode override: auto, requests, playwright, chromium |
--config PATH |
Use this config file |
--no-config |
Disable config loading |
--auth-file PATH |
Use this auth file |
--no-auth |
Disable auth loading |
--profile NAME |
Force a named auth profile |
--print-config |
Deprecated; use htmlquill config show URL |
Browser mode details
auto(default): triesrequestsfirst; on HTTP 403 or detected challenge page, falls back to system Chromium, then Playwright.requests: plain HTTP viarequests.chromium: uses system Chromium via subprocess.playwright: uses Playwright Chromium (optional dependency).
Configuration files
htmlquill resolves config file paths in this order:
--config PATHHTMLQUILL_CONFIG$XDG_CONFIG_HOME/htmlquill/config.toml~/.config/htmlquill/config.toml
Example config.toml:
version = 1
[defaults]
adapter = "html"
browser = "auto"
timeout = 30.0
fail_on_challenge = true
fallback_on_challenge = true
[paths]
auth_file = "~/.config/htmlquill/auth.json"
[challenge]
markers = [
"Performing security verification",
"verifies you are not a bot",
"You've been blocked by network security",
"blocked by network security",
"If you think you've been blocked by mistake, file a ticket",
]
[sites."medium.com"]
browser = "chromium"
timeout = 60.0
auth = "medium"
Authentication
HtmlQuill supports browser-state auth profiles through auth.json.
Use this when a site works in an already-authenticated browser session and you want HtmlQuill to reuse that state.
Auth file resolution order:
--auth-file PATHHTMLQUILL_AUTH[paths].auth_filefrom config$XDG_CONFIG_HOME/htmlquill/auth.jsonor~/.config/htmlquill/auth.json
Example auth.json:
{
"version": 1,
"profiles": {
"medium": {
"kind": "browser_state",
"playwright_storage_state": "~/.config/htmlquill/auth/medium.storage-state.json",
"chromium_user_data_dir": "~/.config/htmlquill/chromium/medium"
}
}
}
Security notes:
- Do not commit auth files, storage-state files, or browser profile directories.
- Recommended permissions:
chmod 600 ~/.config/htmlquill/auth.json. - Recommended browser profile directory permissions:
chmod 700 ~/.config/htmlquill/chromium/medium.
HtmlQuill no longer ships a Reddit API/OAuth adapter. Reddit URLs are processed through the normal HTML fetch path, the same as other URLs. If Reddit returns a network-security or login interstitial, use a browser-based fetch profile, retry later, or export/save the page manually. htmlquill auth login reddit is intentionally not available.
Library usage
from htmlquill import html_to_markdown, url_to_markdown
markdown = html_to_markdown("<h1>Hello</h1><p>World</p>")
markdown = url_to_markdown("https://example.com")
# New optional controls (all optional)
markdown = url_to_markdown(
"https://example.com",
browser="requests",
config=True,
auth=False,
)
Development
pip install -e ".[dev]"
pytest -q
ruff check .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file htmlquill-0.1.0.tar.gz.
File metadata
- Download URL: htmlquill-0.1.0.tar.gz
- Upload date:
- Size: 58.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86c851aa1de57e2b1dcb911988d9737a6eb3daeecfc3617d1614132d74a04bae
|
|
| MD5 |
6194781e7319d40c39878fbb8a5d2f41
|
|
| BLAKE2b-256 |
11b89a6180150011fd608bedbb3908bcf5989ed340eff41caf0bcd3356c5e6e5
|
File details
Details for the file htmlquill-0.1.0-py3-none-any.whl.
File metadata
- Download URL: htmlquill-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9879c308278263082e064925852152df7a9c7889dffe297c02c98e02e1d38ae
|
|
| MD5 |
ea361d86506dd3aee9d7d0fff94fe8a5
|
|
| BLAKE2b-256 |
5dd6ee5ed281fafc1eefec68a73950e4977a05a208674d4fbae6f828c8c9868b
|