Skip to main content

A library for cleaning HTML content by removing specified tags and attributes.

Project description

clean_html_for_llm

This library provides a method to clean HTML content by removing specified tags and attributes while keeping specified attributes. It is particularly useful for preprocessing HTML data to remove noisy tags, making it easier for language models (LLMs) to understand the HTML and generate accurate responses. PyPI Version License

Installation

You can install the clean_html_for_llm library using pip:

pip install clean-html-for-llm

Usage

from clean_html_for_llm import clean_html

html_content = '<div id="main" style="color:red">Hello <script>alert("World")</script></div>'
cleaned_html = clean_html(html_content, tags_to_remove=['script'], attributes_to_keep=['id'])
print(cleaned_html)
# Output: '<div id="main">Hello </div>'

The clean_html function takes the following arguments:

  • html_to_clean (str): The HTML content to clean.
  • tags_to_remove (List[str]): List of tags to remove from the HTML content. Default is ['style', 'svg', 'script'].
  • attributes_to_keep (List[str]): List of attributes to keep in the HTML tags. Default is ['id', 'href'].

You can customize the tags and attributes to remove or keep based on your requirements.

Examples

Example 1:

html_content = '<div id="content" class="main">This is a <span style="font-size: 18px;">paragraph</span>.</div>'
cleaned_html = clean_html(html_content)
print(cleaned_html)
# Output: '<div id="content">This is a <span>paragraph</span>.</div>'

Example 2:

html_content = '<p class="content">Click <a href="https://example.com">here</a> for more information.</p>'
cleaned_html = clean_html(html_content, tags_to_remove=['a'], attributes_to_keep=['class'])
print(cleaned_html)
# Output: '<p class="content">Click </p>'

License

This library is released under the MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clean_html_for_llm-1.3.0.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

clean_html_for_llm-1.3.0-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file clean_html_for_llm-1.3.0.tar.gz.

File metadata

  • Download URL: clean_html_for_llm-1.3.0.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.3

File hashes

Hashes for clean_html_for_llm-1.3.0.tar.gz
Algorithm Hash digest
SHA256 45d58f21c8f4abc08785fd443419ad330a400a85d1d9d0868a4b0fd8aff8b018
MD5 c03c52b14dbcb794bfdd5b3ad6479fcc
BLAKE2b-256 7f9170afdef8d5378e77bb5c39c837f9d16458718915494e11c6015e96cd07fe

See more details on using hashes here.

File details

Details for the file clean_html_for_llm-1.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for clean_html_for_llm-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7db1b243175541851e7355756cd7d59436f240ecbcb0ac75e1955d629b9d6d5b
MD5 714d2a38a2225cfd91b39c30aac3da79
BLAKE2b-256 85e0c5ae90afdb4bade92e9c109febea86e77351f8dfdd11d6abe8f30c32f536

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page