Skip to main content

Page Object pattern for Scrapy

Project description

PyPI Version Supported Python Versions Build Status Coverage report

scrapy-poet implements Page Object pattern for Scrapy.

License is BSD 3-clause.

Installation

pip install scrapy-poet

scrapy-poet requires Python >= 3.6 and Scrapy 2.1.0+.

Usage

First, enable middleware in your settings.py:

DOWNLOADER_MIDDLEWARES = {
   'scrapy_poet.InjectionMiddleware': 543,
}

After that you can write spiders which use page object pattern to separate extraction code from a spider:

import scrapy
from web_poet.pages import WebPage


class BookPage(WebPage):
    def to_item(self):
        return {
            'url': self.url,
            'name': self.css("title::text").get(),
        }


class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for url in response.css('.image_container a::attr(href)').getall():
            yield response.follow(url, self.parse_book)

    def parse_book(self, response, book_page: BookPage):
        yield book_page.to_item()

TODO: document motivation, the rest of the features, provide more usage examples, explain shortcuts, etc. For now, please check spiders in “example” folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders

Contributing

Use tox to run tests with different Python versions:

tox

The command above also runs type checks; we use mypy.

Changes

0.0.2 (2020-04-28)

The repository is renamed to scrapy-poet, and split into two:

  • web-poet (https://github.com/scrapinghub/web-poet) contains definitions and code useful for writing Page Objects for web data extraction - it is not tied to Scrapy;

  • scrapy-poet (this package) provides Scrapy integration for such Page Objects.

API of the library changed in a backwards incompatible way; see README and examples.

New features:

  • DummyResponse annotation allows to skip downloading of scrapy Response.

  • callback_for works for Scrapy disk queues if it is used to create a spider method (but not in its inline form)

  • Page objects may require page objects as dependencies; dependencies are resolved recursively and built as needed.

  • InjectionMiddleware supports async def and asyncio providers.

0.0.1 (2019-08-28)

Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-poet-0.0.2.tar.gz (51.4 kB view details)

Uploaded Source

File details

Details for the file scrapy-poet-0.0.2.tar.gz.

File metadata

  • Download URL: scrapy-poet-0.0.2.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.2

File hashes

Hashes for scrapy-poet-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b751e9a796a867a7a42ae2cfec685c6054cbc99cebdea0b0efe6313932a77f48
MD5 506e9c55ddecd5dc75171a093be538c0
BLAKE2b-256 a8dcde7640a90539b3c4b4a08912cbb3f147ff106f395abebcecbb1c3b99b692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page