Skip to main content

A library for cleaning HTML content by removing specified tags and attributes.

Project description

clean_html_for_llm

This library provides a method to clean HTML content by removing specified tags and attributes while keeping specified attributes. It is particularly useful for preprocessing HTML data to remove noisy tags, making it easier for language models (LLMs) to understand the HTML and generate accurate responses. This is helpful if you are querying LLMs with your HTML data. PyPI Version License

Installation

You can install the clean_html_for_llm library using pip:

pip install clean-html-for-llm

Usage

from clean_html_for_llm import clean_html

html_content = '<div id="main" style="color:red">Hello <script>alert("World")</script></div>'
cleaned_html = clean_html(html_content, tags_to_remove=['script'], attributes_to_keep=['id'])
print(cleaned_html)
# Output: '<div id="main">Hello </div>'

The clean_html function takes the following arguments:

  • html_to_clean (str): The HTML content to clean.
  • tags_to_remove (List[str]): List of tags to remove from the HTML content. Default is ['style', 'svg', 'script'].
  • attributes_to_keep (List[str]): List of attributes to keep in the HTML tags. Default is ['id', 'href'].

You can customize the tags and attributes to remove or keep based on your requirements.

Examples

Example 1:

html_content = '<div id="content" class="main">This is a <span style="font-size: 18px;">paragraph</span>.</div>'
cleaned_html = clean_html(html_content)
print(cleaned_html)
# Output: '<div id="content">This is a <span>paragraph</span>.</div>'

Example 2:

html_content = '<p class="content">Click <a href="https://example.com">here</a> for more information.</p>'
cleaned_html = clean_html(html_content, tags_to_remove=['a'], attributes_to_keep=['class'])
print(cleaned_html)
# Output: '<p class="content">Click </p>'

License

This library is released under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_html_for_llm-1.3.2.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

clean_html_for_llm-1.3.2-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file clean_html_for_llm-1.3.2.tar.gz.

File metadata

  • Download URL: clean_html_for_llm-1.3.2.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.3

File hashes

Hashes for clean_html_for_llm-1.3.2.tar.gz
Algorithm Hash digest
SHA256 d39f1fe6761ff84f1917ced1c04985aa2411100f8ea3169475d55c78024971ae
MD5 ef3064246c004f8e99bec6036a82e247
BLAKE2b-256 f0efbdaccf53409f54dae633ca584ce420d4f8868745d17e2d534192c0615fe2

See more details on using hashes here.

File details

Details for the file clean_html_for_llm-1.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for clean_html_for_llm-1.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0d33c726627be03b63bcc9880ef9f7cab9863969443de31fe4d26d84e7733545
MD5 e12a002f63f24c7b0430c0b16e537c30
BLAKE2b-256 f90f32dab024ae329c19da55a69defecfbd6e658ef00ed6ea440f8aa1e8991eb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page