Skip to main content

Clean up HTML using BeautifulSoup and filter rules.

Project description

collective.soupstrainer

Quite often there is a need to clean up HTML from some source, be it user input or data gathered by scraping, which needs to be cleaned up. With the SoupStrainer class in collective.soupstrainer this is made easy. It uses beautifulsoup4 to parse and clean up HTML. The constructor of the class takes four arguments.

exclusions

This is a list of tuples with two items each. The first item is a list of tag names, the second item is a list of attributes. If the list of attributes is empty, then each tag in the first list is completely removed from the passed in HTML. If the list of tags is empty, then each attribute listed is completely removed. If there are both tags and attributes listed, then the attributes are only removed from matching tags.

style_whitelist

This is a white list of CSS styles allowed in ‘style’ attributes. All other styles are removed.

class_blacklist

This is a black list for CSS classes. Each matching class is removed from ‘class’ attributes.

parser

This is the parser used by beautifulsoup4, when the strainer is called with a string. It must be an installed parser for beautifulsoup4, defaults to html.parser

An instance of the SoupStrainer class can be called directly with one argument. The argument can either be a string, in which case it will internally be parsed by beautifulsoup4 and the result will be unicode (or string in python 3), or it can be a parsed HTML tree created by beautifulsoup4, in which case it will be modified in place and be returned again.

Changelog

2.2 (2021-03-25)

  • Do not stop after the first replace of a tag which is to be excluded. (#8)

  • Add support for Python 3.8 and 3.9.

2.1 (2019-02-06)

  • Add support for Python 3 and PyPy.

2.0 (2017-10-19)

Backwards incompatible changes

  • Update to beautifulsoup4.

  • Add a parameter parser to SoupStrainer which specifies the parser used by beautifulsoup4.

1.0 - 2008-11-14

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collective.soupstrainer-2.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

collective.soupstrainer-2.2-py2.py3-none-any.whl (5.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file collective.soupstrainer-2.2.tar.gz.

File metadata

  • Download URL: collective.soupstrainer-2.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/None requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.10

File hashes

Hashes for collective.soupstrainer-2.2.tar.gz
Algorithm Hash digest
SHA256 f10cb82543ee4c194abfc64c4783e1192b04fb5807382da10c511716715873ff
MD5 d194fb3cebeffd9d14c79e6afe982a48
BLAKE2b-256 399d6837ea22da3c285518bae2aa839c9fde15104fded108851638e5319e819f

See more details on using hashes here.

File details

Details for the file collective.soupstrainer-2.2-py2.py3-none-any.whl.

File metadata

  • Download URL: collective.soupstrainer-2.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 5.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/None requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.10

File hashes

Hashes for collective.soupstrainer-2.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 b9d6dea2c2fd8649701a277f7a4bb6427cb35a3e9350b52915b2d209442b2879
MD5 068d71afb0f9e6110d647dfd2cb492a2
BLAKE2b-256 3ca182d584ccc90d860a4d49dd07db18bbde21abb9c50779b147ee1e2966bbed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page