Skip to main content

html to markdown

Project description

PyPI - Version PyPI - Python Version PyPI - Downloads codecov

HtmlQuill

Convert HTML or a URL to Markdown.

Installation

pip install htmlquill

Optional Playwright backend:

pip install "htmlquill[browser]"
playwright install chromium

CLI usage

# Auto-save using the first Markdown heading
htmlquill convert https://example.com/article

# Manual output path
htmlquill convert https://example.com/article -o article.md

# Preview generated filename without saving
htmlquill convert https://example.com/article --filename-only

# Print Markdown content without saving
htmlquill convert https://example.com/article --stdout

# Save generated filename to a target directory
htmlquill convert https://example.com/article --output-dir notes

# Limit generated filename stem length
htmlquill convert https://example.com/article --filename-max-length 60

# Inspect effective config
htmlquill config show https://example.com

# Initialize config and inspect paths
htmlquill config init
htmlquill config path

# Run diagnostics
htmlquill doctor

# Count generated Markdown structure
htmlquill analyse example.md

# Preview Markdown in the terminal
htmlquill preview example.md

htmlquill SOURCE is retained as shorthand for htmlquill convert SOURCE; it now follows the same auto-save behavior unless --stdout is used.

Command overview

  • htmlquill convert SOURCE [options]
  • htmlquill config path|show|init|validate
  • htmlquill auth path|show|init
  • htmlquill doctor [--url URL] [--fetch] [--json] [--strict]
  • htmlquill analyse SOURCE (alias: htmlquill analyze SOURCE)
  • htmlquill preview SOURCE

Convert options

Option Description
SOURCE URL (https://...), HTML file path, or - for stdin
-o, --output PATH Manual output file path. Overrides generated filename.
--stdout Print converted Markdown to stdout and do not save.
--filename-only Print resolved output filename and do not save.
--filename-max-length N Max generated filename stem length, excluding .md. Default: 80.
--output-dir DIR Directory for generated output files. Default: current directory.
--force Overwrite generated output target instead of adding a numeric suffix.
--timeout HTTP timeout override in seconds
--user-agent Custom HTTP User-Agent header
--browser Fetching mode override: auto, requests, playwright, chromium
--config PATH Use this config file
--no-config Disable config loading
--auth-file PATH Use this auth file
--no-auth Disable auth loading
--profile NAME Force a named auth profile
--print-config Deprecated; use htmlquill config show URL

Browser mode details

  • auto (default): tries requests first; on HTTP 403 or detected challenge page, falls back to system Chromium, then Playwright.
  • requests: plain HTTP via requests.
  • chromium: uses system Chromium via subprocess.
  • playwright: uses Playwright Chromium (optional dependency).

Configuration files

htmlquill resolves config file paths in this order:

  1. --config PATH
  2. HTMLQUILL_CONFIG
  3. $XDG_CONFIG_HOME/htmlquill/config.toml
  4. ~/.config/htmlquill/config.toml

Example config.toml:

version = 1

[defaults]
adapter = "html"
browser = "auto"
timeout = 30.0
fail_on_challenge = true
fallback_on_challenge = true

[paths]
auth_file = "~/.config/htmlquill/auth.json"

[challenge]
markers = [
  "Performing security verification",
  "verifies you are not a bot",
  "You've been blocked by network security",
  "blocked by network security",
  "If you think you've been blocked by mistake, file a ticket",
]

[sites."medium.com"]
browser = "chromium"
timeout = 60.0
auth = "medium"

Authentication

HtmlQuill supports browser-state auth profiles through auth.json. Use this when a site works in an already-authenticated browser session and you want HtmlQuill to reuse that state.

Auth file resolution order:

  1. --auth-file PATH
  2. HTMLQUILL_AUTH
  3. [paths].auth_file from config
  4. $XDG_CONFIG_HOME/htmlquill/auth.json or ~/.config/htmlquill/auth.json

Example auth.json:

{
  "version": 1,
  "profiles": {
    "medium": {
      "kind": "browser_state",
      "playwright_storage_state": "~/.config/htmlquill/auth/medium.storage-state.json",
      "chromium_user_data_dir": "~/.config/htmlquill/chromium/medium"
    }
  }
}

Security notes:

  • Do not commit auth files, storage-state files, or browser profile directories.
  • Recommended permissions: chmod 600 ~/.config/htmlquill/auth.json.
  • Recommended browser profile directory permissions: chmod 700 ~/.config/htmlquill/chromium/medium.

Reddit

HtmlQuill no longer ships a Reddit API/OAuth adapter. Reddit URLs are processed through the normal HTML fetch path, the same as other URLs. If Reddit returns a network-security or login interstitial, use a browser-based fetch profile, retry later, or export/save the page manually. htmlquill auth login reddit is intentionally not available.

Library usage

from htmlquill import html_to_markdown, url_to_markdown

markdown = html_to_markdown("<h1>Hello</h1><p>World</p>")

markdown = url_to_markdown("https://example.com")

# New optional controls (all optional)
markdown = url_to_markdown(
    "https://example.com",
    browser="requests",
    config=True,
    auth=False,
)

Development

pip install -e ".[dev]"
pytest -q
ruff check .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmlquill-0.1.0.tar.gz (58.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

htmlquill-0.1.0-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file htmlquill-0.1.0.tar.gz.

File metadata

  • Download URL: htmlquill-0.1.0.tar.gz
  • Upload date:
  • Size: 58.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htmlquill-0.1.0.tar.gz
Algorithm Hash digest
SHA256 86c851aa1de57e2b1dcb911988d9737a6eb3daeecfc3617d1614132d74a04bae
MD5 6194781e7319d40c39878fbb8a5d2f41
BLAKE2b-256 11b89a6180150011fd608bedbb3908bcf5989ed340eff41caf0bcd3356c5e6e5

See more details on using hashes here.

File details

Details for the file htmlquill-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: htmlquill-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htmlquill-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9879c308278263082e064925852152df7a9c7889dffe297c02c98e02e1d38ae
MD5 ea361d86506dd3aee9d7d0fff94fe8a5
BLAKE2b-256 5dd6ee5ed281fafc1eefec68a73950e4977a05a208674d4fbae6f828c8c9868b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page