extract markdown flavored text from html
Project description
html5lib_to_markdown
This package offers a way to convert HTML to the Markdown format.
This package is currently targeting a SUBSET of full HTML->Markdown conversion to address internal needs. More functionality will be added as needed. Pull requests are welcome.
There are many packages that convert HTML to Markdown. Why create another package to do this task?
-
Licensing. This package is available via the permissive MIT license. There are no GPL restrictions, which affect about a third of the other libraries that perform this task.
-
Tests. This package ships with many tests to ensure things keep working as desired. Several existing libraries do not have tests or adequate test coverage.
-
Customized Feature: Use HTML for certain elements instead of Markdown syntax. Sometimes we WANT to use and tags, and not turn them into Markdown syntax.
-
Customized Feature: Clean up common html issues and make pretty Markdown. This library doesn't just create Markdown, but optimized/pretty Markdown. This library attempts to optimize-away extra newlines and spaces, creating a concise and readable Markdown version.
-
Customized Feature: ignore unwanted html tags and attributes.
-
Customized Feature: Idempotent when possible. This is more of a goal than a guarantee, but text that is processed through this library should not change if re-processed through this library whenever possible. In other words, we're aiming for this:
as_markdown = to_markdown(html) == to_markdown(as_markdown) == to_markdown(to_markdown(html)) == to_markdown(to_markdown(as_markdown))
This can't be guaranteed in all situations because of how Markdown and HTML work, but it is a goal. This library should not add artifacts.
At a minimum, our goal is this
as_markdown = to_markdown(html) == to_markdown(to_html(as_markdown))
as_html = to_html(as_markdown) == to_html(to_markdown(as_html))
-
Customized feature: A departure from the core Markdown specification was needed for a few elements:
- Render
img
tags, not Markdown format - Render
a
tags, not Markdown format
- Render
-
Customized feature: Python2 and Python3 compatibility. This shouldn't be a feature, but it is. Some excellent packages in this space stopped supporting Python2 already. This package aims to keep Python2 support around a bit longer than the official cutoff date, because legacy systems exist.
-
Core Implementation Detail. This package is implemented as a
htmllib5
"tree adapter", which means it can be potentially be layered into many htm5lib processing routines. Other packages useBeautifulSoup
,lxml
orHTMLParser
. These other projects are all great, but require re-processing if you are already doing things withhtml5lib
.
Unsupported Features
Angled links are not currently supported, for example:
<http://example.com>
They are not compatible with the html5lib parser, and trying to support them will require a lot of work.
Pretty Markdown?
What is pretty Markdown?
- There is a max of 2 newlines (1 blank line) between elements.
- blank lines are dropped to the lowest allowable nesting of blockquotes or lists
- whitespace is shown via HTML rendering rules
Other Libraries
The interface was inspired by the bleach
user-input sanitization library, which relies on html5lib
If you just need a standard and pure "HTML to Markdown" convertor, I recommend the following libraries:
-
antimarkdown
http://github.com/Crossway/antimarkdown/- built on
lxml
- MIT license
- built on
-
markdownify
http://github.com/matthewwithanm/python-markdownify- built on
BeautifulSoup
- MIT license
- built on
License & Copyright
This package is available under the MIT License. The code and tests are: Copyright 2019 Jonathan Vanasco (jonathan@findmeon.com)
Additional code and tests are isolated in the tests_working directory (and soon to be tests), with copyright attributed to the antimarkdown and markdownify projects. both are used under the MIT License and credited in the source
Environment Variables
MD_DEBUG_TOKENS
- will use string representation for tokens (human readable!) instead of optimizing with intsMD_DEBUG_STACKS
- willprint()
the tokens duringMD_DEBUG_STACKS_SIMPLE
- willprint()
the tokens in a simplified form
TODO
See TODO.txt
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file html5lib_to_markdown-0.0.6.tar.gz
.
File metadata
- Download URL: html5lib_to_markdown-0.0.6.tar.gz
- Upload date:
- Size: 35.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f332a970050ec7d3fcad045d1eba8ea81a17916ae5f4e53e31e2bc85733549db |
|
MD5 | 5397048df53e0decaf04c6edc408f59d |
|
BLAKE2b-256 | 32520edac3947f06dc499584915a1bd09b46bf6fb3767715c52c33814be3d6b1 |