Scrapy HTML5 support
Project description
Parsel H5
Scrapy integration for the html5ever and lexbor HTML parsers.
This package provides a Scrapy Downloader Middleware that replaces the default lxml-based HTML parsing with a HTML5
one.
Why html5ever?
- Better HTML5 compliance: Parses HTML the way browsers do
- Handles malformed HTML gracefully: More forgiving with real-world HTML
- As fast as Parsel: Rust-based parser with Python bindings (
markupever)
Why Lexbor?
- Fastest HTML5 parser: C-based parser with Python bindings (
selectolax) - Better HTML5 compliance: Parses HTML the way browsers do
- Handles malformed HTML gracefully: More forgiving with real-world HTML
Installation
pip install scrapy-h5
Or with uv:
uv add scrapy-h5
Quick start
1. Enable the middleware in your Scrapy project
Add to your settings.py:
DOWNLOADER_MIDDLEWARES = {
# Must be above HttpCompressionMiddleware, closer to the end of response processing
# (farther to the beginning of request processing)
'scrapy_h5.HtmlFiveResponseMiddleware': 45,
}
# Optional: disable globally (backend by default)
# SCRAPY_H5_BACKEND = None
2. Use in your spider
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
# CSS selectors work as expected
titles = response.css('h1::text').getall()
# Attribute extraction
links = response.css('a::attr(href)').getall()
# Chained selectors
for item in response.css('div.product'):
yield {
'name': item.css('h2::text').get(),
'price': item.css('.price::text').get(),
'url': item.css('a::attr(href)').get(),
}
3. Using with CrawlSpider
from scrapy.spiders import CrawlSpider, Rule
from scrapy_h5 import LinkExtractor
class MyCrawlSpider(CrawlSpider):
name = 'mycrawler'
start_urls = ['https://example.com']
# Use HTML5 link extractor with rules
rules = (
Rule(LinkExtractor(allow=r'/products/'), callback='parse_product', follow=True),
)
def parse_product(self, response):
yield {
'name': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
}
XPath and JMESPath support
XPath and JMESPath selectors are not supported. Use CSS selectors instead.
Per-request control
You can change/disable html5 backend per request using meta:
def start_requests(self):
# HTML5 parsing backend (default)
yield scrapy.Request(url, callback=self.parse)
# Disable html5 for this request (use lxml instead)
yield scrapy.Request(
url2,
callback=self.parse_legacy,
meta={'scrapy_h5_backend': False}
)
def parse_with_html5(self, response):
# Force html5 even if SCRAPY_H5_BACKEND=False
yield scrapy.Request(
url,
callback=self.parse,
meta={'scrapy_h5_backend': 'html5ever'}
)
API reference
Classes
HtmlFiveSelector: Selector class wrappinghtml5everandlexborelementsHtmlFiveSelectorList: List of selectors with bulk operationsHtmlFiveResponse: Response class with html5-based selectorHtmlFiveResponseMiddleware: Scrapy Downloader Middleware that replacesHtmlResponsewithHtmlFiveResponseLinkExtractor: Link extractor using HTML5 parsers (lexbor or html5ever)
Important: The LinkExtractor only works with HtmlFiveResponse. Enable the middleware to automatically convert
all HTML responses to HtmlFiveResponse.
Exceptions
XPathConversionError: Raised when an XPath expression cannot be converted to CSSHtmlFiveParseError: Raised when HTML parsing failsHtmlFiveSelectorError: Base exception for selector errorsHtmlFiveSelectError: Raised when CSS selection fails
Settings
| Setting | Default | Description |
|---|---|---|
SCRAPY_H5_BACKEND |
lexbor |
Global html5 backend. lexbor and html5ever enables, False disables |
Request meta
| Key | Type | Description |
|---|---|---|
scrapy_h5_backend |
bool |
Per-request override. lexbor and html5ever enables, False disables |
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scrapy_h5-0.5.1.tar.gz.
File metadata
- Download URL: scrapy_h5-0.5.1.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df6aedc88f92fd627b4e633ef9a942a7aa022dbae36a3748688885b60e7cac9b
|
|
| MD5 |
9e01c9a8bf63048e35431b513b8f5c98
|
|
| BLAKE2b-256 |
cb64cd82320594448c016a51f78207fae4b26326ba7d2404480771d6e8a06e98
|
Provenance
The following attestation bundles were made for scrapy_h5-0.5.1.tar.gz:
Publisher:
publish-to-pypi.yml on shkarupa-alex/scrapy-h5
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_h5-0.5.1.tar.gz -
Subject digest:
df6aedc88f92fd627b4e633ef9a942a7aa022dbae36a3748688885b60e7cac9b - Sigstore transparency entry: 849940156
- Sigstore integration time:
-
Permalink:
shkarupa-alex/scrapy-h5@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2 -
Branch / Tag:
refs/tags/0.5.1 - Owner: https://github.com/shkarupa-alex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file scrapy_h5-0.5.1-py3-none-any.whl.
File metadata
- Download URL: scrapy_h5-0.5.1-py3-none-any.whl
- Upload date:
- Size: 20.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c2aba1e306917ffe56039eafebda2217241cce9e7557c37be53b286db6d35a9
|
|
| MD5 |
3b8dced915f054d10eaa8e37921f467b
|
|
| BLAKE2b-256 |
5669f1e0ad06694783b98a7fd3f359149acd9e5f81fad3b6380750e97d98be99
|
Provenance
The following attestation bundles were made for scrapy_h5-0.5.1-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on shkarupa-alex/scrapy-h5
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scrapy_h5-0.5.1-py3-none-any.whl -
Subject digest:
8c2aba1e306917ffe56039eafebda2217241cce9e7557c37be53b286db6d35a9 - Sigstore transparency entry: 849940159
- Sigstore integration time:
-
Permalink:
shkarupa-alex/scrapy-h5@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2 -
Branch / Tag:
refs/tags/0.5.1 - Owner: https://github.com/shkarupa-alex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2 -
Trigger Event:
push
-
Statement type: