Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.
Project description
HTML Reader Mode
A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.
Installation
pip install html-reader-mode
Usage
from html_reader_mode import HTMLReaderMode
html_content = """
<html>
<body>
<div id="header">Header content</div>
<div id="content">
<h1>Article Title</h1>
<p>This is the main content of the article.</p>
</div>
<div id="footer">Footer content</div>
</body>
</html>
"""
reader = HTMLReaderMode()
content = reader.sanitize(html_content)
print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]
Features
- Content Extraction: Identifies and extracts the main text blocks.
- Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
- Customizable: Configure block tags, script tags, and filtering thresholds.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file html_reader_mode-0.1.1.tar.gz.
File metadata
- Download URL: html_reader_mode-0.1.1.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aed5029907097a785b09145b6ffcaed628da8d65d1dd0533408b09043bf49a2
|
|
| MD5 |
d0cc3de2a6f07281e961ecc812781a50
|
|
| BLAKE2b-256 |
83dfee50a691effbc4f4a243dc2791a2074e864a06d76d267a86b858e70815ef
|
File details
Details for the file html_reader_mode-0.1.1-py3-none-any.whl.
File metadata
- Download URL: html_reader_mode-0.1.1-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1bb5018418eb0c906053bcce2b661df9ef76f67c8a8855db856034c3c450e3f
|
|
| MD5 |
48851d7c8d254fdf72bd3a70c972b3ae
|
|
| BLAKE2b-256 |
9f25fd364bcc8a8e8a10393238d59c806128d57c5abf9101e5bd934f7042ad20
|