A Python package to extract the main content from HTML documents
Project description
HTML Content Extractor
This Python package provides a function to extract the "main content" from HTML documents.
Relevancy is determined by an algorithm that favors the deepest parent with the most h1, h2, h3 and p tags.
Installation
You can install this package via pip:
$ pip install html-content-extractor
Usage
from html_content_extractor import extract_content
>>> html = "<div><h1>An HTML Page</h1><p>This is some HTML content.</p></div>"
>>> content = extract_content(html, format='plaintext')
>>> print(content)
"An HTML Page\n\nThis is some HTML content."
>>> markdown = extract_content(html, format='markdown')
>>> print(content)
"# An HTML Page\n\nThis is some HTML content."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for html_content_extractor-0.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f31a04a293a0600ed06753399f3267d488ecc24cad6f06e328189901bef9821 |
|
MD5 | 406bf3cc5edee47d885889fa50e8ed22 |
|
BLAKE2b-256 | fc29ffbe2ae1024a72a1599e0b050390cfb04518c03c61ca229221279d829d6f |
Close
Hashes for html_content_extractor-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ced23f77bdf0e8d6917415a5a03c505ead700f55cc7bf1e801525fa293fdc84 |
|
MD5 | cc7818d541743861c0c0d663486406e9 |
|
BLAKE2b-256 | 736c8039dec5861af227c12468769cd80082566d4c3f1deb99f586ce360cb654 |