Skip to main content

A library for cleaning HTML content by removing specified tags and attributes.

Project description

clean_html_for_llm

This library provides a method to clean HTML content by removing specified tags and attributes while keeping specified attributes. It is particularly useful for preprocessing HTML data to remove noisy tags, making it easier for language models (LLMs) to understand the HTML and generate accurate responses. PyPI Version License

Installation

You can install the clean_html_for_llm library using pip:

pip install clean-html-for-llm

Usage

from clean_html_for_llm import clean_html

html_content = '<div id="main" style="color:red">Hello <script>alert("World")</script></div>'
cleaned_html = clean_html(html_content, tags_to_remove=['script'], attributes_to_keep=['id'])
print(cleaned_html)
# Output: '<div id="main">Hello </div>'

The clean_html function takes the following arguments:

  • html_to_clean (str): The HTML content to clean.
  • tags_to_remove (List[str]): List of tags to remove from the HTML content. Default is ['style', 'svg', 'script'].
  • attributes_to_keep (List[str]): List of attributes to keep in the HTML tags. Default is ['id', 'href'].

You can customize the tags and attributes to remove or keep based on your requirements.

Examples

Example 1:

html_content = '<div id="content" class="main">This is a <span style="font-size: 18px;">paragraph</span>.</div>'
cleaned_html = clean_html(html_content)
print(cleaned_html)
# Output: '<div id="content">This is a <span>paragraph</span>.</div>'

Example 2:

html_content = '<p class="content">Click <a href="https://example.com">here</a> for more information.</p>'
cleaned_html = clean_html(html_content, tags_to_remove=['a'], attributes_to_keep=['class'])
print(cleaned_html)
# Output: '<p class="content">Click </p>'

License

This library is released under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_html_for_llm-1.2.0.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

clean_html_for_llm-1.2.0-py3-none-any.whl (4.4 kB view details)

Uploaded Python 3

File details

Details for the file clean_html_for_llm-1.2.0.tar.gz.

File metadata

  • Download URL: clean_html_for_llm-1.2.0.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.3

File hashes

Hashes for clean_html_for_llm-1.2.0.tar.gz
Algorithm Hash digest
SHA256 e335c4b921dd625c5c09844df120bdfa9339157da11b09558cc5b25065d0503b
MD5 855cde1e5fa5c7b2f9da086da6192475
BLAKE2b-256 d7f81304a1f31f570ca6b1bb8c90fb6c964fa8e0eaca26b7a09499b4043ffb09

See more details on using hashes here.

File details

Details for the file clean_html_for_llm-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for clean_html_for_llm-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02e79d9737bee62830c00aa2090a32ea30b3a12167798ca4a81195dad95c4c36
MD5 2942718755c7e97bcd23d26ad946111c
BLAKE2b-256 d9c90c6aeac8a67ac0517339b5c185662284884e1840a3b8653e490f5f2583ea

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page