Skip to main content

Convert HTML to markdown.

Project description

GitHub Workflow Status Pypi version License Pypi Downloads

Installation

pip install markdownify

Usage

Convert some HTML to Markdown:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'

Specify tags to exclude:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a'])  # > '**Yay** GitHub'

...or specify the tags you want to include:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b'])  # > '**Yay** GitHub'

Options

Markdownify supports the following options:

strip

A list of tags to strip. This option can’t be used with the convert option.

convert

A list of tags to convert. This option can’t be used with the strip option.

autolinks

A boolean indicating whether the “automatic link” style should be used when a a tag’s contents match its href. Defaults to True.

default_title

A boolean to enable setting the title of a link to its href, if no title is given. Defaults to False.

heading_style

Defines how headings should be converted. Accepted values are ATX, ATX_CLOSED, SETEXT, and UNDERLINED (which is an alias for SETEXT). Defaults to UNDERLINED.

bullets

An iterable (string, list, or tuple) of bullet styles to be used. If the iterable only contains one item, it will be used regardless of how deeply lists are nested. Otherwise, the bullet will alternate based on nesting level. Defaults to '*+-'.

strong_em_symbol

In markdown, both * and _ are used to encode strong or emphasized texts. Either of these symbols can be chosen by the options ASTERISK (default) or UNDERSCORE respectively.

sub_symbol, sup_symbol

Define the chars that surround <sub> and <sup> text. Defaults to an empty string, because this is non-standard behavior. Could be something like ~ and ^ to result in ~sub~ and ^sup^. If the value starts with < and ends with >, it is treated as an HTML tag and a / is inserted after the < in the string used after the text; this allows specifying <sub> to use raw HTML in the output for subscripts, for example.

newline_style

Defines the style of marking linebreaks (<br>) in markdown. The default value SPACES of this option will adopt the usual two spaces and a newline, while BACKSLASH will convert a linebreak to \\n (a backslash and a newline). While the latter convention is non-standard, it is commonly preferred and supported by a lot of interpreters.

code_language

Defines the language that should be assumed for all <pre> sections. Useful, if all code on a page is in the same programming language and should be annotated with ```python or similar. Defaults to '' (empty string) and can be any string.

code_language_callback

When the HTML code contains pre tags that in some way provide the code language, for example as class, this callback can be used to extract the language from the tag and prefix it to the converted pre tag. The callback gets one single argument, an BeautifylSoup object, and returns a string containing the code language, or None. An example to use the class name as code language could be:

def callback(el):
    return el['class'][0] if el.has_attr('class') else None

Defaults to None.

escape_asterisks

If set to False, do not escape * to \* in text. Defaults to True.

escape_underscores

If set to False, do not escape _ to \_ in text. Defaults to True.

escape_misc

If set to False, do not escape miscellaneous punctuation characters that sometimes have Markdown significance in text. Defaults to True.

keep_inline_images_in

Images are converted to their alt-text when the images are located inside headlines or table cells. If some inline images should be converted to markdown images instead, this option can be set to a list of parent tags that should be allowed to contain inline images, for example ['td']. Defaults to an empty list.

wrap, wrap_width

If wrap is set to True, all text paragraphs are wrapped at wrap_width characters. Defaults to False and 80. Use with newline_style=BACKSLASH to keep line breaks in paragraphs.

Options may be specified as kwargs to the markdownify function, or as a nested Options class in MarkdownConverter subclasses.

Converting BeautifulSoup objects

from markdownify import MarkdownConverter

# Create shorthand method for conversion
def md(soup, **options):
    return MarkdownConverter(**options).convert_soup(soup)

Creating Custom Converters

If you have a special usecase that calls for a special conversion, you can always inherit from MarkdownConverter and override the method you want to change. The function that handles a HTML tag named abc is called convert_abc(self, el, text, convert_as_inline) and returns a string containing the converted HTML tag. The MarkdownConverter object will handle the conversion based on the function names:

from markdownify import MarkdownConverter

class ImageBlockConverter(MarkdownConverter):
    """
    Create a custom MarkdownConverter that adds two newlines after an image
    """
    def convert_img(self, el, text, convert_as_inline):
        return super().convert_img(el, text, convert_as_inline) + '\n\n'

# Create shorthand method for conversion
def md(html, **options):
    return ImageBlockConverter(**options).convert(html)
from markdownify import MarkdownConverter

class IgnoreParagraphsConverter(MarkdownConverter):
    """
    Create a custom MarkdownConverter that ignores paragraphs
    """
    def convert_p(self, el, text, convert_as_inline):
        return ''

# Create shorthand method for conversion
def md(html, **options):
    return IgnoreParagraphsConverter(**options).convert(html)

Command Line Interface

Use markdownify example.html > example.md or pipe input from stdin (cat example.html | markdownify > example.md). Call markdownify -h to see all available options. They are the same as listed above and take the same arguments.

Development

To run tests and the linter run pip install tox once, then tox.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdownify-0.13.1.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

markdownify-0.13.1-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file markdownify-0.13.1.tar.gz.

File metadata

  • Download URL: markdownify-0.13.1.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for markdownify-0.13.1.tar.gz
Algorithm Hash digest
SHA256 ab257f9e6bd4075118828a28c9d02f8a4bfeb7421f558834aa79b2dfeb32a098
MD5 3e399ac76fe62bf96a4497a8069385c8
BLAKE2b-256 195abd1b685ee9efbfb0b22774a30188dfb4048c64e8a6c80a65a7f207af4ea1

See more details on using hashes here.

File details

Details for the file markdownify-0.13.1-py3-none-any.whl.

File metadata

  • Download URL: markdownify-0.13.1-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for markdownify-0.13.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1d181d43d20902bcc69d7be85b5316ed174d0dda72ff56e14ae4c95a4a407d22
MD5 abe30641101d9a1032aea0dc91bf63b4
BLAKE2b-256 6ce96e2757a670b8c48bc48eff1c20cb9d71f1476e844038bdbdb76f17e6a12b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page