Skip to main content

HTML cleaner from lxml project

Project description

lxml_html_clean

Motivation

This project was initially a part of lxml. Because HTML cleaner is designed as blocklist-based, many reports about possible security vulnerabilities were filed for lxml and that make the project problematic for security-sensitive environments. Therefore we decided to extract the problematic part to a separate project.

Important: the HTML Cleaner in lxml_html_clean is not considered appropriate for security sensitive environments. See e.g. nh3 for an alternative.

This project uses functions from Python's urllib.parse for URL parsing which do not validate inputs. For more information on potential security risks, refer to the URL parsing security documentation. A maliciously crafted URL could potentially bypass the allowed hosts check in Cleaner.

Installation

You can install this project directly via pip install lxml_html_clean or as an extra of lxml via pip install lxml[html_clean]. Both ways install this project together with lxml itself.

Security

For discussions regarding security-related issues or any sensitive reports, please contact us privately. You can reach out to lbalhar(at)redhat.com or frenzy.madness(at)gmail.com to ensure your concerns are addressed confidentially and securely.

Documentation

https://lxml-html-clean.readthedocs.io/

License

BSD-3-Clause

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lxml_html_clean-0.4.5.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lxml_html_clean-0.4.5-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file lxml_html_clean-0.4.5.tar.gz.

File metadata

  • Download URL: lxml_html_clean-0.4.5.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for lxml_html_clean-0.4.5.tar.gz
Algorithm Hash digest
SHA256 e2a4c7d5beedd17cd7b484d848a0571e54baa239a4f9df5546e3acba7f990560
MD5 a0ed690a6001d8f702a74dd6c23f42dd
BLAKE2b-256 0a63195dfdde380a84df309e3bccf4384b034b745dba43426886f7ae623b4fba

See more details on using hashes here.

File details

Details for the file lxml_html_clean-0.4.5-py3-none-any.whl.

File metadata

File hashes

Hashes for lxml_html_clean-0.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c76fcadd1e5bfb9b8bafc2200d51e4e78eb0dad67f56881c21dfb6484c7e7746
MD5 f90eabadc4b157042c676cf89e3575bb
BLAKE2b-256 6abd6e2b76a6c5dee10397db9c929f0c5066766ec1036046f0335b7ca7ca08b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page