Skip to main content

HTML sanitizer

Project description

https://travis-ci.org/matthiask/html-sanitizer.svg?branch=master

This is a whitelist-based and very opinionated HTML sanitizer that can be used both for untrusted and trusted sources. It attempts to clean up the mess made by various rich text editors and or copy-pasting to make styling of webpages simpler and more consistent.

It had its humble beginnings as feincms.utils.html.cleanse.cleanse_html and feincms-cleanse, and while it’s still humble its name has been changed to HTML sanitizer to underline the fact that it has absolutely no dependency on either Django or FeinCMS.

Goals

HTML sanitizer goes further than e.g. bleach in that it not only ensures that content is safe and tags and attributes conform to a given whitelist, but also applies additional transforms to HTML fragments. A short list of goals follows:

  • Clean up HTML using a very restricted set of allowed tags and attributes.

  • Convert some tags (such as <span style="...">, <b> and <i>) into either <strong> or <em> (but never both).

  • Normalize whitespace by removing repeated line breaks, empty paragraphs and other empty elements.

  • Merge adjacent tags of the same type (such as several <strong> or <h3> directly after each other.

  • Automatically remove redundant list markers inside <li> tags.

  • Clean up some uglyness such as paragraphs inside paragraphs or list elements etc.

Usage

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer()  # default configuration
>>> sanitizer.sanitize('<span style="font-weight:bold">some text</span>')
'<strong>some text</strong>'

Settings

  • span elements will always be removed from the tree, but only after inspecting their style tags (bold spans are converted into strong tags, italic spans into em tags)

  • b and i tags will always be converted into strong and em (if strong and em are allowed at all)

The default settings are:

DEFAULT_SETTINGS = {
    'tags': {
        'a', 'h1', 'h2', 'h3', 'strong', 'em', 'p', 'ul', 'ol',
        'li', 'br', 'sub', 'sup', 'hr',
    },
    'attributes': {
        'a': ('href', 'name', 'target', 'title', 'id'),
    },
    'empty': {'hr', 'a', 'br'},
    'separate': {'a', 'p', 'li'},
    'add_nofollow': False,
    'autolink': False,
    'element_filters': [],
    'sanitize_href': html_sanitizer.sanitizer.sanitize_href,
}

The keys’ meaning is as follows:

  • tags: A set() of allowed tags.

  • attributes: A dict() mapping tags to their allowed attributes.

  • empty: Tags which are allowed to be empty. By default, empty tags (containing no text or only whitespace) are dropped.

  • separate: Tags which are not merged if they appear as siblings. By default, tags of the same type are merged.

  • add_nofollow: Whether to add rel="nofollow" to all links.

  • autolink: Enable lxml’s autolinker. May be either a boolean or a dictionary; a dictionary is passed as keyword arguments to autolink.

  • element_filters: Additional filters that are called on all elements in the tree. The tree is processed in reverse depth-first order. Under certain circumstances elements are processed more than once (search the code for backlog.append)

  • sanitize_href: A callable that gets anchor’s href value and returns a sanitized version. The default implementation checks whether links start with a few allowed prefixes, and if not, returns a single hash (#).

Settings can be specified partially when initializing a sanitizer instance, but are still checked for consistency (e.g. it’s not allowed to have tags in empty that are not in tags, that is, tags that are allowed to be empty but at the same time not allowed at all). An example for an even more restricted configuration might be:

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer({
...     'tags': ('h1', 'h2', 'p'),
...     'attributes': {},
...     'empty': set(),
...     'separate': set(),
... })

The rationale for such a restricted set of allowed tags (e.g. no images) is documented in the design decisions section of django-content-editor’s documentation.

Django

HTML sanitizer does not depend on Django, but ships with a module which makes configuring sanitizers using Django settings easier. Usage is as follows:

>>> from html_sanitizer.django import get_sanitizer
>>> sanitizer = get_sanitizer([name=...])

Different sanitizers can be configured. The default configuration is aptly named 'default'. Example settings follow:

HTML_SANITIZERS = {
    'default': {
    'tags': ...,
    ...
}

The 'default' configuration is special: If it isn’t explicitly defined, the default configuration above is used instead.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html-sanitizer-1.1.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

html_sanitizer-1.1.1-py2.py3-none-any.whl (12.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file html-sanitizer-1.1.1.tar.gz.

File metadata

File hashes

Hashes for html-sanitizer-1.1.1.tar.gz
Algorithm Hash digest
SHA256 1986d13e1fe6b8c1cb248c21639f5ed1ad2208d3389ea207d5866f64c6620bc7
MD5 b9f9e17ee804dd54292209b08c0d77d3
BLAKE2b-256 8a7888f4145564d26a93f06bc7f9887fb83f33376bbd9fa7f0d3895ea490ec59

See more details on using hashes here.

Provenance

File details

Details for the file html_sanitizer-1.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for html_sanitizer-1.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 95465577bc26bb49e8a0ef8f2a8a8266d5ab278077987ffa16fe1491393e7ab6
MD5 9c63d0dc519f974a6d8f4768db78509a
BLAKE2b-256 f43590586305971625bdeae03043084c0813b55b8539f9953efe565c0454c6f0

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page