Skip to main content

Scrapy HTML5 support

Project description

Parsel H5

Scrapy integration for the html5ever and lexbor HTML parsers.

This package provides a Scrapy Downloader Middleware that replaces the default lxml-based HTML parsing with a HTML5 one.

Why html5ever?

  • Better HTML5 compliance: Parses HTML the way browsers do
  • Handles malformed HTML gracefully: More forgiving with real-world HTML
  • As fast as Parsel: Rust-based parser with Python bindings (markupever)

Why Lexbor?

  • Fastest HTML5 parser: C-based parser with Python bindings (selectolax)
  • Better HTML5 compliance: Parses HTML the way browsers do
  • Handles malformed HTML gracefully: More forgiving with real-world HTML

Installation

pip install scrapy-h5

Or with uv:

uv add scrapy-h5

Quick start

1. Enable the middleware in your Scrapy project

Add to your settings.py:

DOWNLOADER_MIDDLEWARES = {
    # Must be above HttpCompressionMiddleware, closer to the end of response processing 
    # (farther to the beginning of request processing)
    'scrapy_h5.HtmlFiveResponseMiddleware': 45,
}

# Optional: disable globally (backend by default)
# SCRAPY_H5_BACKEND = None

2. Use in your spider

import scrapy


class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # CSS selectors work as expected
        titles = response.css('h1::text').getall()

        # Attribute extraction
        links = response.css('a::attr(href)').getall()

        # Chained selectors
        for item in response.css('div.product'):
            yield {
                'name': item.css('h2::text').get(),
                'price': item.css('.price::text').get(),
                'url': item.css('a::attr(href)').get(),
            }

3. Using with CrawlSpider

from scrapy.spiders import CrawlSpider, Rule
from scrapy_h5 import LinkExtractor


class MyCrawlSpider(CrawlSpider):
    name = 'mycrawler'
    start_urls = ['https://example.com']

    # Use HTML5 link extractor with rules
    rules = (
        Rule(LinkExtractor(allow=r'/products/'), callback='parse_product', follow=True),
    )

    def parse_product(self, response):
        yield {
            'name': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
        }

XPath and JMESPath support

XPath and JMESPath selectors are not supported. Use CSS selectors instead.

Per-request control

You can change/disable html5 backend per request using meta:

def start_requests(self):
    # HTML5 parsing backend (default)
    yield scrapy.Request(url, callback=self.parse)

    # Disable html5 for this request (use lxml instead)
    yield scrapy.Request(
        url2,
        callback=self.parse_legacy,
        meta={'scrapy_h5_backend': False}
    )


def parse_with_html5(self, response):
    # Force html5 even if SCRAPY_H5_BACKEND=False
    yield scrapy.Request(
        url,
        callback=self.parse,
        meta={'scrapy_h5_backend': 'html5ever'}
    )

API reference

Classes

  • HtmlFiveSelector: Selector class wrapping html5ever and lexbor elements
  • HtmlFiveSelectorList: List of selectors with bulk operations
  • HtmlFiveResponse: Response class with html5-based selector
  • HtmlFiveResponseMiddleware: Scrapy Downloader Middleware that replaces HtmlResponse with HtmlFiveResponse
  • LinkExtractor: Link extractor using HTML5 parsers (lexbor or html5ever)

Important: The LinkExtractor only works with HtmlFiveResponse. Enable the middleware to automatically convert all HTML responses to HtmlFiveResponse.

Exceptions

  • XPathConversionError: Raised when an XPath expression cannot be converted to CSS
  • HtmlFiveParseError: Raised when HTML parsing fails
  • HtmlFiveSelectorError: Base exception for selector errors
  • HtmlFiveSelectError: Raised when CSS selection fails

Settings

Setting Default Description
SCRAPY_H5_BACKEND lexbor Global html5 backend. lexbor and html5ever enables, False disables

Request meta

Key Type Description
scrapy_h5_backend bool Per-request override. lexbor and html5ever enables, False disables

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_h5-0.5.1.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_h5-0.5.1-py3-none-any.whl (20.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapy_h5-0.5.1.tar.gz.

File metadata

  • Download URL: scrapy_h5-0.5.1.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapy_h5-0.5.1.tar.gz
Algorithm Hash digest
SHA256 df6aedc88f92fd627b4e633ef9a942a7aa022dbae36a3748688885b60e7cac9b
MD5 9e01c9a8bf63048e35431b513b8f5c98
BLAKE2b-256 cb64cd82320594448c016a51f78207fae4b26326ba7d2404480771d6e8a06e98

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_h5-0.5.1.tar.gz:

Publisher: publish-to-pypi.yml on shkarupa-alex/scrapy-h5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scrapy_h5-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: scrapy_h5-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 20.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scrapy_h5-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8c2aba1e306917ffe56039eafebda2217241cce9e7557c37be53b286db6d35a9
MD5 3b8dced915f054d10eaa8e37921f467b
BLAKE2b-256 5669f1e0ad06694783b98a7fd3f359149acd9e5f81fad3b6380750e97d98be99

See more details on using hashes here.

Provenance

The following attestation bundles were made for scrapy_h5-0.5.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on shkarupa-alex/scrapy-h5

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page