Skip to main content

Clean and normalize HTML.

Project description

PyPI Version Supported Python Versions Build Status Coverage report

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Quick start

Installation

Install the library with pip:

pip install clear-html

Usage

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

html="""
        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
        </div>
     """
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                            <body>
                                <h1>Hello!</h1>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
                                </div>
                            </body>
                            </html>""")
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)
print(cleaned_html)

Both of the different approaches above would print the following:

<article>

<p>Some text to be</p>

<p>cleaned up!</p>

</article>

Other interesting functions:

  • cleaned_node_to_text: convert the cleaned node to plain text

  • formatted_text.clean_doc: low level method to control more aspects of the cleaning up

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clear_html-0.4.1.tar.gz (23.9 kB view hashes)

Uploaded Source

Built Distribution

clear_html-0.4.1-py3-none-any.whl (24.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page