Skip to main content

A library for cleaning HTML content by removing specified tags and attributes.

Project description

clean_html_for_llm

This library provides a method to clean HTML content by removing specified tags and attributes while keeping specified attributes. It is particularly useful for preprocessing HTML data to remove noisy tags, making it easier for language models (LLMs) to understand the HTML and generate accurate responses. PyPI Version License

Installation

You can install the clean_html_for_llm library using pip:

pip install clean-html-for-llm

Usage

from clean_html_for_llm import clean_html

html_content = '<div id="main" style="color:red">Hello <script>alert("World")</script></div>'
cleaned_html = clean_html(html_content, tags_to_remove=['script'], attributes_to_keep=['id'])
print(cleaned_html)
# Output: '<div id="main">Hello </div>'

The clean_html function takes the following arguments:

  • html_to_clean (str): The HTML content to clean.
  • tags_to_remove (List[str]): List of tags to remove from the HTML content. Default is ['style', 'svg', 'script'].
  • attributes_to_keep (List[str]): List of attributes to keep in the HTML tags. Default is ['id', 'href'].

You can customize the tags and attributes to remove or keep based on your requirements.

Examples

Example 1:

html_content = '<div id="content" class="main">This is a <span style="font-size: 18px;">paragraph</span>.</div>'
cleaned_html = clean_html(html_content)
print(cleaned_html)
# Output: '<div id="content">This is a <span>paragraph</span>.</div>'

Example 2:

html_content = '<p class="content">Click <a href="https://example.com">here</a> for more information.</p>'
cleaned_html = clean_html(html_content, tags_to_remove=['a'], attributes_to_keep=['class'])
print(cleaned_html)
# Output: '<p class="content">Click </p>'

License

This library is released under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_html_for_llm-1.1.0.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

clean_html_for_llm-1.1.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file clean_html_for_llm-1.1.0.tar.gz.

File metadata

  • Download URL: clean_html_for_llm-1.1.0.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.3

File hashes

Hashes for clean_html_for_llm-1.1.0.tar.gz
Algorithm Hash digest
SHA256 cc0f5e5e0fe3ead82b9cc2db69360144a577e57ff440101ebe5c2cd495b97714
MD5 ba4d02338eecf376d982becb0471cb8f
BLAKE2b-256 c86943b9702fb23867b7fbfbb9073248b0cc22a094cef3c8a97ba646905e3e3c

See more details on using hashes here.

File details

Details for the file clean_html_for_llm-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for clean_html_for_llm-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c36fda9a51b8c680f8bbc642c9bb48d615070693cdf948201ef3b6a2dc04a4d0
MD5 e222b1ecda8c5967c97781c6d6c09e26
BLAKE2b-256 bf7eb014805953102bcaee17e6bea1c0de810bd88f97a08c61536b9947d9264b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page