Skip to main content

Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.

Project description

HTML Reader Mode

A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.

Installation

pip install html-reader-mode

Usage

from html_reader_mode import HTMLReaderMode

html_content = """
<html>
    <body>
        <div id="header">Header content</div>
        <div id="content">
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </div>
        <div id="footer">Footer content</div>
    </body>
</html>
"""

reader = HTMLReaderMode()
content = reader.sanitize(html_content)

print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]

Features

  • Content Extraction: Identifies and extracts the main text blocks.
  • Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
  • Customizable: Configure block tags, script tags, and filtering thresholds.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_reader_mode-0.1.1.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_reader_mode-0.1.1-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file html_reader_mode-0.1.1.tar.gz.

File metadata

  • Download URL: html_reader_mode-0.1.1.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for html_reader_mode-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4aed5029907097a785b09145b6ffcaed628da8d65d1dd0533408b09043bf49a2
MD5 d0cc3de2a6f07281e961ecc812781a50
BLAKE2b-256 83dfee50a691effbc4f4a243dc2791a2074e864a06d76d267a86b858e70815ef

See more details on using hashes here.

File details

Details for the file html_reader_mode-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for html_reader_mode-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1bb5018418eb0c906053bcce2b661df9ef76f67c8a8855db856034c3c450e3f
MD5 48851d7c8d254fdf72bd3a70c972b3ae
BLAKE2b-256 9f25fd364bcc8a8e8a10393238d59c806128d57c5abf9101e5bd934f7042ad20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page