Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Project description
scrapy-beautifulsoup
Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Installation
The package is on PyPI and can be installed with pip:
pip install scrapy-beautifulsoup
Configuration
Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:
DOWNLOADER_MIDDLEWARES = { 'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400 }
By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:
BEAUTIFULSOUP_PARSER = "html5lib" # or BEAUTIFULSOUP_PARSER = "lxml"
html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:
pip install html5lib
Motivation
BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy-beautifulsoup-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cf3158d257bb3d95dc45b8892d35dbf1d356afa4c33d4b1829fb34cdfbbd3be |
|
MD5 | fcf611c65047d783ebbadf80d0718b9f |
|
BLAKE2b-256 | 83533b51bc3dc26e4007241f9bcdb9693501e026192898be39cfe33791db0fff |
Hashes for scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 354fb34f6d302768cb2e6380464c3310934af1de673714f3d6c46b8d0f88c3a1 |
|
MD5 | 3f522f73be574c5d2088fb9335dc7660 |
|
BLAKE2b-256 | 70067c0f6a2f0a595cfa767fb123635f1b347fc23fe60d5f3b94eabc19582520 |