Page Object pattern for Scrapy
Project description
scrapy-poet implements Page Object pattern for Scrapy.
License is BSD 3-clause.
Installation
pip install scrapy-poet
scrapy-poet requires Python >= 3.6 and Scrapy 2.1.0+.
Usage
First, enable middleware in your settings.py:
DOWNLOADER_MIDDLEWARES = {
'scrapy_poet.InjectionMiddleware': 543,
}
After that you can write spiders which use page object pattern to separate extraction code from a spider:
import scrapy
from web_poet.pages import WebPage
class BookPage(WebPage):
def to_item(self):
return {
'url': self.url,
'name': self.css("title::text").get(),
}
class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
for url in response.css('.image_container a::attr(href)').getall():
yield response.follow(url, self.parse_book)
def parse_book(self, response, book_page: BookPage):
yield book_page.to_item()
TODO: document motivation, the rest of the features, provide more usage examples, explain shortcuts, etc. For now, please check spiders in “example” folder: https://github.com/scrapinghub/scrapy-poet/tree/master/example/example/spiders
Contributing
Source code: https://github.com/scrapinghub/scrapy-poet
Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues
Use tox to run tests with different Python versions:
tox
The command above also runs type checks; we use mypy.
Changes
0.0.2 (2020-04-28)
The repository is renamed to scrapy-poet, and split into two:
web-poet (https://github.com/scrapinghub/web-poet) contains definitions and code useful for writing Page Objects for web data extraction - it is not tied to Scrapy;
scrapy-poet (this package) provides Scrapy integration for such Page Objects.
API of the library changed in a backwards incompatible way; see README and examples.
New features:
DummyResponse annotation allows to skip downloading of scrapy Response.
callback_for works for Scrapy disk queues if it is used to create a spider method (but not in its inline form)
Page objects may require page objects as dependencies; dependencies are resolved recursively and built as needed.
InjectionMiddleware supports async def and asyncio providers.
0.0.1 (2019-08-28)
Initial release.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapy-poet-0.0.2.tar.gz.
File metadata
- Download URL: scrapy-poet-0.0.2.tar.gz
- Upload date:
- Size: 51.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b751e9a796a867a7a42ae2cfec685c6054cbc99cebdea0b0efe6313932a77f48
|
|
| MD5 |
506e9c55ddecd5dc75171a093be538c0
|
|
| BLAKE2b-256 |
a8dcde7640a90539b3c4b4a08912cbb3f147ff106f395abebcecbb1c3b99b692
|