Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Project description
scrapy-beautifulsoup
Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
Installation
The package is on PyPI and can be installed with pip:
pip install scrapy-beautifulsoup
Configuration
Add the middleware to DOWNLOADER_MIDDLEWARES dictionary setting:
DOWNLOADER_MIDDLEWARES = { 'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400 }
By default, BeautifulSoup would use the built-in html.parser parser. To change it, set the BEAUTIFULSOUP_PARSER setting:
BEAUTIFULSOUP_PARSER = "html5lib" # or BEAUTIFULSOUP_PARSER = "lxml"
html5lib is an extremely lenient parser and, if the target HTML is seriously broken, you might consider being it your first choice. Note: html5lib has to be installed in this case:
pip install html5lib
Motivation
BeautifulSoup itself with the help of an underlying parser of choice does a pretty good job of handling non-well-formed or broken HTML. In some cases, it makes sense to pipe the HTML through BeautifulSoup to “fix” it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-beautifulsoup-0.0.2.tar.gz
.
File metadata
- Download URL: scrapy-beautifulsoup-0.0.2.tar.gz
- Upload date:
- Size: 2.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6cf3158d257bb3d95dc45b8892d35dbf1d356afa4c33d4b1829fb34cdfbbd3be |
|
MD5 | fcf611c65047d783ebbadf80d0718b9f |
|
BLAKE2b-256 | 83533b51bc3dc26e4007241f9bcdb9693501e026192898be39cfe33791db0fff |
File details
Details for the file scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl
.
File metadata
- Download URL: scrapy_beautifulsoup-0.0.2-py2.py3-none-any.whl
- Upload date:
- Size: 4.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 354fb34f6d302768cb2e6380464c3310934af1de673714f3d6c46b8d0f88c3a1 |
|
MD5 | 3f522f73be574c5d2088fb9335dc7660 |
|
BLAKE2b-256 | 70067c0f6a2f0a595cfa767fb123635f1b347fc23fe60d5f3b94eabc19582520 |