A Python package to extract the main content from HTML documents
Project description
HTML Content Extractor
This Python package provides a function to extract the "main content" from HTML documents.
Relevancy is determined by an algorithm that favors the deepest parent with the most h1, h2, h3 and p tags.
Installation
You can install this package via pip:
$ pip install html-content-extractor
Usage
from html_content_extractor import extract_content
>>> html = "<div><h1>An HTML Page</h1><p>This is some HTML content.</p></div>"
>>> content = extract_content(html, format='plaintext')
>>> print(content)
"An HTML Page\n\nThis is some HTML content."
>>> markdown = extract_content(html, format='markdown')
>>> print(content)
"# An HTML Page\n\nThis is some HTML content."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file html_content_extractor-0.0.3.tar.gz
.
File metadata
- Download URL: html_content_extractor-0.0.3.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
1f31a04a293a0600ed06753399f3267d488ecc24cad6f06e328189901bef9821
|
|
MD5 |
406bf3cc5edee47d885889fa50e8ed22
|
|
BLAKE2b-256 |
fc29ffbe2ae1024a72a1599e0b050390cfb04518c03c61ca229221279d829d6f
|
File details
Details for the file html_content_extractor-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: html_content_extractor-0.0.3-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
9ced23f77bdf0e8d6917415a5a03c505ead700f55cc7bf1e801525fa293fdc84
|
|
MD5 |
cc7818d541743861c0c0d663486406e9
|
|
BLAKE2b-256 |
736c8039dec5861af227c12468769cd80082566d4c3f1deb99f586ce360cb654
|