Skip to main content

Convert HTML to markdown

Project description

html_to_markdown

This library is a refactored and modernized fork of markdownify, supporting Python 3.9 and above.

Differences with the Markdownify

  • The refactored codebase uses a strict functional approach - no classes are involved.
  • There is full typing with strict MyPy strict adherence and a py.typed file included.
  • The convert_to_markdown function allows passing a pre-configured instance of BeautifulSoup instead of html.
  • This library releases follows standard semver. Its version v1.0.0 was branched from markdownify's v0.13.1, at which point versioning is no longer aligned.

Installation

pip install html_to_markdown

Usage

Convert an string HTML to Markdown:

from html_to_markdown import convert_to_markdown

convert_to_markdown('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'

Or pass a pre-configured instance of BeautifulSoup:

from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown

soup = BeautifulSoup('<b>Yay</b> <a href="http://github.com">GitHub</a>', 'lxml')  # lxml requires an extra dependency.

convert_to_markdown(soup)  # > '**Yay** [GitHub](http://github.com)'

Options

The convert_to_markdown function accepts the following kwargs:

  • autolinks (bool): Automatically convert valid URLs into Markdown links. Defaults to True.
  • bullets (str): A string of characters to use for bullet points in lists. Defaults to '*+-'.
  • code_language (str): Default language identifier for fenced code blocks. Defaults to an empty string.
  • code_language_callback (Callable[[Any], str] | None): Function to dynamically determine the language for code blocks.
  • convert (Iterable[str] | None): A list of tag names to convert to Markdown. If None, all supported tags are converted.
  • default_title (bool): Use the default title when converting certain elements (e.g., links). Defaults to False.
  • escape_asterisks (bool): Escape asterisks (*) to prevent unintended Markdown formatting. Defaults to True.
  • escape_misc (bool): Escape miscellaneous characters to prevent conflicts in Markdown. Defaults to True.
  • escape_underscores (bool): Escape underscores (_) to prevent unintended italic formatting. Defaults to True.
  • heading_style (Literal["underlined", "atx", "atx_closed"]): The style to use for Markdown headings. Defaults to " underlined".
  • keep_inline_images_in (Iterable[str] | None): Tags in which inline images should be preserved. Defaults to None.
  • newline_style (Literal["spaces", "backslash"]): Style for handling newlines in text content. Defaults to "spaces".
  • strip (Iterable[str] | None): Tags to strip from the output. Defaults to None.
  • strong_em_symbol (Literal["", "_"]): Symbol to use for strong/emphasized text. Defaults to "".
  • sub_symbol (str): Custom symbol for subscript text. Defaults to an empty string.
  • sup_symbol (str): Custom symbol for superscript text. Defaults to an empty string.
  • wrap (bool): Wrap text to the specified width. Defaults to False.
  • wrap_width (int): The number of characters at which to wrap text. Defaults to 80.
  • convert_as_inline (bool): Treat the content as inline elements (no block elements like paragraphs). Defaults to False.

CLI

For compatibility with the original markdownify, a CLI is provided. Use html_to_markdown example.html > example.md or pipe input from stdin:

cat example.html | html_to_markdown > example.md

Use html_to_markdown -h to see all available options. They are the same as listed above and take the same arguments.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_to_markdown-1.1.0.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

html_to_markdown-1.1.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file html_to_markdown-1.1.0.tar.gz.

File metadata

  • Download URL: html_to_markdown-1.1.0.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for html_to_markdown-1.1.0.tar.gz
Algorithm Hash digest
SHA256 f6912217f555f526261096ea886e1a87073b1c5327228954315d94965871c1cd
MD5 6980fa6fb5cfc30d9062d646d3ffd2c3
BLAKE2b-256 74d352475e5b023ced614b7738bec1d99386ad893c1cbdcdea63865a0db82d5f

See more details on using hashes here.

File details

Details for the file html_to_markdown-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for html_to_markdown-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1aa42c056b6f3606f7d137c90b893a655d11bc818b93fc534bafdde4ea21553b
MD5 4057325f43bafd09479241f5214cd266
BLAKE2b-256 14e01c78aff17b862d2e0f0edea0f1f24a089ef71cd8393435afede9850f1f29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page