A Python Library for converting HTML to Markdown.

When crawling doc-oriented websites contains news, knowledge market like quora, or even Github, sometimes, you might like to save the article. The best media for storage would be .md file, some articles are even generated from .md files. ohHTML2Markdown is able to convert HTML fragment or a complete html file to a human-friendly .md file. Although there would be some irresponsible use in HTML tags, ohHTML2Markdown will do its best to ensure the output stays in its best quality.

目前支持的HTML标签有:h1~h6, p, a, img, del, b, strong, i, em, hr, br, ul, ol, table, blockquote, code, pre, span, title, time, iframe, section, div, html, body, head

Currently, supported HTML tags include: h1~h6, p, a, img, del, b, strong, i, em, hr, br, ul, ol, table, blockquote, code, pre, span, title, time, iframe, section, div, html, body, head


ohHTML2Markdown is mainly built on BeautifulSoup Library, to be more specific, its html.parser parser and the .descendants object. The parser is able to convert a html file to a tree structure, the using .content we can access all sub-trees on the same level; .descendants returns an generator which contains all nodes in the current (sub)tree (including itself).

通过判断descendants的长度可以知道该子树有没有子节点,然后 根据当前tag的类型,可以选择继续进行深度遍历还是将它转换为Markdown的语义。

By checking the length of descendants generator, we can know whether this tree has descendants or not(not that this tree could also be one of descendants of another tree on its topper level). Then based the type of tags included, we make the decision to go further, or convert it to markdown semantics.

其他关于本库的一些细节:发布自己的Python包 - ohHTML2Markdown

More about this library, 发布自己的Python包 - ohHTML2Markdown. This post is Chinese.

安装 Installation

pip install ohHTML2Markdown

使用 Usage

import ohHtml2Markdown as h2m

# 从字符串读取 Read from string
result = h2m.Parser("<h1>h1</h1>", h2m.Parser.STRING).convert()

# 或从文件读取 Read from file
result = h2m.Parser("test/test.html", h2m.Parser.FILE).convert()

with open("test/", 'w', encoding='utf-8') as file:

