Skip to main content

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Project description

PyPI version Requirements Status


Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup


The package is on PyPI and can be installed with pip:

pip install scrapy-beautifulsoup


Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:


html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:

pip install html5lib


BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-beautifulsoup-0.0.2.tar.gz (2.3 kB view hashes)

Uploaded Source

Built Distribution

scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl (4.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page