Skip to main content

HTML cleaner from lxml project

Project description

lxml_html_clean

Motivation

This project was initially a part of lxml. Because HTML cleaner is designed as blocklist-based, many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.

Important: the HTML Cleaner in lxml_html_clean is not considered appropriate for security sensitive environments. See e.g. nh3 for an alternative.

This project uses functions from Python's urllib.parse for URL parsing which do not validate inputs. For more information on potential security risks, refer to the URL parsing security documentation. A maliciously crafted URL could potentially bypass the allowed hosts check in Cleaner.

Installation

You can install this project directly via pip install lxml_html_clean or as an extra of lxml via pip install lxml[html_clean]. Both ways install this project together with lxml itself.

Security

For discussions regarding security-related issues or any sensitive reports, please contact us privately. You can reach out to lbalhar(at)redhat.com or frenzy.madness(at)gmail.com to ensure your concerns are addressed confidentially and securely.

Documentation

https://lxml-html-clean.readthedocs.io/

License

BSD-3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lxml_html_clean-0.4.4.tar.gz (23.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lxml_html_clean-0.4.4-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file lxml_html_clean-0.4.4.tar.gz.

File metadata

  • Download URL: lxml_html_clean-0.4.4.tar.gz
  • Upload date:
  • Size: 23.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for lxml_html_clean-0.4.4.tar.gz
Algorithm Hash digest
SHA256 58f39a9d632711202ed1d6d0b9b47a904e306c85de5761543b90e3e3f736acfb
MD5 64559dec1028861eb0e72c7754dacb59
BLAKE2b-256 9aa45c62acfacd69ff4f5db395100f5cfb9b54e7ac8c69a235e4e939fd13f021

See more details on using hashes here.

File details

Details for the file lxml_html_clean-0.4.4-py3-none-any.whl.

File metadata

File hashes

Hashes for lxml_html_clean-0.4.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ce2ef506614ecb85ee1c5fe0a2aa45b06a19514ec7949e9c8f34f06925cfabcb
MD5 d6e1608a0977b0720475ecf9b2b3c555
BLAKE2b-256 d9767ffc1d3005cf7749123bc47cb3ea343cd97b0ac2211bab40f57283577d0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page