Skip to main content

Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.

Project description

HTML Reader Mode

A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.

Installation

pip install html-reader-mode

Usage

from html_reader_mode import HTMLReaderMode

html_content = """
<html>
    <body>
        <div id="header">Header content</div>
        <div id="content">
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </div>
        <div id="footer">Footer content</div>
    </body>
</html>
"""

reader = HTMLReaderMode()
content = reader.sanitize(html_content)

print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]

Features

  • Content Extraction: Identifies and extracts the main text blocks.
  • Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
  • Customizable: Configure block tags, script tags, and filtering thresholds.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html_reader_mode-0.1.0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

html_reader_mode-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file html_reader_mode-0.1.0.tar.gz.

File metadata

  • Download URL: html_reader_mode-0.1.0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for html_reader_mode-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be310df94d27bd1b3429aae94816962ef622ff005b848721cd05ac29895eba59
MD5 31eba5838ba289f34222dd19e191aced
BLAKE2b-256 af5124501d6c3bbafae385d4bc0d489c3d24d542c35d9c6a39b8f2e8503c533a

See more details on using hashes here.

File details

Details for the file html_reader_mode-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for html_reader_mode-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 029aa6804b9a5bd1cf2211228335b80d1cfa58a270229473096159c7908f3fcf
MD5 a164fc345ea10faced0ed22aeb5d9f7f
BLAKE2b-256 010aa293e5f8dd653d9ce98880d425518a0a56c013145eab0c110efbc35ca9ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page