Scrapy HTML5 support

Project description

Parsel H5

Scrapy integration for the html5ever and lexbor HTML parsers.

This package provides a Scrapy Downloader Middleware that replaces the default lxml-based HTML parsing with a HTML5 one.

Why html5ever?

Better HTML5 compliance: Parses HTML the way browsers do
Handles malformed HTML gracefully: More forgiving with real-world HTML
As fast as Parsel: Rust-based parser with Python bindings (markupever)

Why Lexbor?

Fastest HTML5 parser: C-based parser with Python bindings (selectolax)
Better HTML5 compliance: Parses HTML the way browsers do
Handles malformed HTML gracefully: More forgiving with real-world HTML

Installation

pip install scrapy-h5

Or with uv:

uv add scrapy-h5

Quick start

1. Enable the middleware in your Scrapy project

Add to your settings.py:

DOWNLOADER_MIDDLEWARES = {
    # Must be above HttpCompressionMiddleware, closer to the end of response processing 
    # (farther to the beginning of request processing)
    'scrapy_h5.HtmlFiveResponseMiddleware': 45,
}

# Optional: disable globally (backend by default)
# SCRAPY_H5_BACKEND = None

2. Use in your spider

import scrapy


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # CSS selectors work as expected
        titles = response.css('h1::text').getall()

        # Attribute extraction
        links = response.css('a::attr(href)').getall()

        # Chained selectors
        for item in response.css('div.product'):
            yield {
                'name': item.css('h2::text').get(),
                'price': item.css('.price::text').get(),
                'url': item.css('a::attr(href)').get(),
            }

3. Using with CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy_h5 import LinkExtractor


class MyCrawlSpider(CrawlSpider):
    name = 'mycrawler'
    start_urls = ['https://example.com']

    # Use HTML5 link extractor with rules
    rules = (
        Rule(LinkExtractor(allow=r'/products/'), callback='parse_product', follow=True),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
        }

XPath and JMESPath support

XPath and JMESPath selectors are not supported. Use CSS selectors instead.

Per-request control

You can change/disable html5 backend per request using meta:

def start_requests(self):
    # HTML5 parsing backend (default)
    yield scrapy.Request(url, callback=self.parse)

    # Disable html5 for this request (use lxml instead)
    yield scrapy.Request(
        url2,
        callback=self.parse_legacy,
        meta={'scrapy_h5_backend': False}
    )


def parse_with_html5(self, response):
    # Force html5 even if SCRAPY_H5_BACKEND=False
    yield scrapy.Request(
        url,
        callback=self.parse,
        meta={'scrapy_h5_backend': 'html5ever'}
    )

API reference

Classes

HtmlFiveSelector: Selector class wrapping html5ever and lexbor elements
HtmlFiveSelectorList: List of selectors with bulk operations
HtmlFiveResponse: Response class with html5-based selector
HtmlFiveResponseMiddleware: Scrapy Downloader Middleware that replaces HtmlResponse with HtmlFiveResponse
LinkExtractor: Link extractor using HTML5 parsers (lexbor or html5ever)

Important: The LinkExtractor only works with HtmlFiveResponse. Enable the middleware to automatically convert all HTML responses to HtmlFiveResponse.

Exceptions

XPathConversionError: Raised when an XPath expression cannot be converted to CSS
HtmlFiveParseError: Raised when HTML parsing fails
HtmlFiveSelectorError: Base exception for selector errors
HtmlFiveSelectError: Raised when CSS selection fails

Settings

Setting	Default	Description
`SCRAPY_H5_BACKEND`	`lexbor`	Global html5 backend. `lexbor` and `html5ever` enables, `False` disables

Request meta

Key	Type	Description
`scrapy_h5_backend`	`bool`	Per-request override. `lexbor` and `html5ever` enables, `False` disables

License

MIT

Project details

Release history Release notifications | RSS feed

This version

0.5.1

Jan 24, 2026

0.5.0

Jan 23, 2026

0.4.0

Jan 23, 2026

0.3.0

Jan 23, 2026

0.2.0

Jan 13, 2026

0.1.1

Jan 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_h5-0.5.1.tar.gz (18.8 kB view details)

Uploaded Jan 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_h5-0.5.1-py3-none-any.whl (20.6 kB view details)

Uploaded Jan 24, 2026 Python 3

File details

Details for the file scrapy_h5-0.5.1.tar.gz.

File metadata

Download URL: scrapy_h5-0.5.1.tar.gz
Upload date: Jan 24, 2026
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapy_h5-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`df6aedc88f92fd627b4e633ef9a942a7aa022dbae36a3748688885b60e7cac9b`
MD5	`9e01c9a8bf63048e35431b513b8f5c98`
BLAKE2b-256	`cb64cd82320594448c016a51f78207fae4b26326ba7d2404480771d6e8a06e98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_h5-0.5.1.tar.gz:

Publisher: publish-to-pypi.yml on shkarupa-alex/scrapy-h5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapy_h5-0.5.1.tar.gz
- Subject digest: df6aedc88f92fd627b4e633ef9a942a7aa022dbae36a3748688885b60e7cac9b
- Sigstore transparency entry: 849940156
- Sigstore integration time: Jan 24, 2026
Source repository:
- Permalink: shkarupa-alex/scrapy-h5@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2
- Branch / Tag: refs/tags/0.5.1
- Owner: https://github.com/shkarupa-alex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2
- Trigger Event: push

File details

Details for the file scrapy_h5-0.5.1-py3-none-any.whl.

File metadata

Download URL: scrapy_h5-0.5.1-py3-none-any.whl
Upload date: Jan 24, 2026
Size: 20.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapy_h5-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c2aba1e306917ffe56039eafebda2217241cce9e7557c37be53b286db6d35a9`
MD5	`3b8dced915f054d10eaa8e37921f467b`
BLAKE2b-256	`5669f1e0ad06694783b98a7fd3f359149acd9e5f81fad3b6380750e97d98be99`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_h5-0.5.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on shkarupa-alex/scrapy-h5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scrapy_h5-0.5.1-py3-none-any.whl
- Subject digest: 8c2aba1e306917ffe56039eafebda2217241cce9e7557c37be53b286db6d35a9
- Sigstore transparency entry: 849940159
- Sigstore integration time: Jan 24, 2026
Source repository:
- Permalink: shkarupa-alex/scrapy-h5@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2
- Branch / Tag: refs/tags/0.5.1
- Owner: https://github.com/shkarupa-alex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@1f39a27e4e284f7948bfebbc32709ab5b26ff0c2
- Trigger Event: push

scrapy-h5 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Parsel H5

Why html5ever?

Why Lexbor?

Installation

Quick start

1. Enable the middleware in your Scrapy project

2. Use in your spider

3. Using with CrawlSpider

XPath and JMESPath support

Per-request control

API reference

Classes

Exceptions

Settings

Request meta

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance