Skip to main content

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Project description

PyPI version Requirements Status

scrapy-beautifulsoup

Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup

Installation

The package is on PyPI and can be installed with pip:

pip install scrapy-beautifulsoup

Configuration

Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:

pip install html5lib

Motivation

BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-beautifulsoup-0.0.2.tar.gz (2.3 kB view details)

Uploaded Source

Built Distribution

scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl (4.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-beautifulsoup-0.0.2.tar.gz.

File metadata

File hashes

Hashes for scrapy-beautifulsoup-0.0.2.tar.gz
Algorithm Hash digest
SHA256 6cf3158d257bb3d95dc45b8892d35dbf1d356afa4c33d4b1829fb34cdfbbd3be
MD5 fcf611c65047d783ebbadf80d0718b9f
BLAKE2b-256 83533b51bc3dc26e4007241f9bcdb9693501e026192898be39cfe33791db0fff

See more details on using hashes here.

File details

Details for the file scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 354fb34f6d302768cb2e6380464c3310934af1de673714f3d6c46b8d0f88c3a1
MD5 3f522f73be574c5d2088fb9335dc7660
BLAKE2b-256 70067c0f6a2f0a595cfa767fb123635f1b347fc23fe60d5f3b94eabc19582520

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page