scrapy-twostage

Use S3 as a cache backend in Scrapy projects.

These details have not been verified by PyPI

Project links

Homepage

Project description

Have you ever written a web scraper, only to find out after a long time that there’s some extra data on the pages you should’ve been scraping all along?

Or a change on a website means your scraper stops working, and you lose days or weeks of data until you can find the time to fix it?

This library aims to solve this problem by splitting a Scrapy scraper up into two asynchronous stages:

Download stage - The website is crawled, and the pages to be scraped are downloaded and saved to disk.
Extract stage - The pages to be scraped are loaded from disk. The desired data is extracted from the pages and exported (e.g. to a file or database).

The crawler logic for the download stage should be kept as simple as possible. It would typically open a known URL and perform very simple actions such as clicking a “next page” button or submitting a search query. This reduces the risk of the downloader breaking if there are minor changes made to the website.

And since all of the raw data is being saved, if you ever decide to change your extractor logic, you can simply re-run the extractor on all of the data that has been downloaded.

Installation

Downloading and installing from PyPI

To install using pip:

$ pip install scrapy-twostage

Or to install using easy_install:

$ easy_install scrapy-twostage

Downloading and installing from source

Download the latest version of scrapy-twostage from http://pypi.python.org/pypi/scrapy-twostage/.

You can install it by doing the following:

$ tar xvfz scrapy-twostage-0.0.0.tar.gz
$ cd scrapy-twostage-0.0.0
# python setup.py install # as root

Using the development version

You can clone the git repository by doing the following:

$ git clone git://github.com/acordiner/scrapy-twostage.git

Using scrapy-twostage

Coming soon…

Bug tracker

If you have any suggestions, bug reports or annoyances please report them at http://github.com/acordiner/scrapy-twostage/issues/

License

This software is licensed under the GPL v2 License. See the LICENSE file in the top distribution directory for the full license text.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.4

Mar 20, 2017

0.0.3

Mar 19, 2017

This version

0.0.2

Mar 19, 2017

0.0.1

Mar 19, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-twostage-0.0.2.tar.gz (4.3 kB view details)

Uploaded Mar 19, 2017 Source

File details

Details for the file scrapy-twostage-0.0.2.tar.gz.

File metadata

Download URL: scrapy-twostage-0.0.2.tar.gz
Upload date: Mar 19, 2017
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapy-twostage-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`6be65becaa4bd23b56cfbbfc8d3513ca2d31fe32985dca69dfe4d696b10102f5`
MD5	`5a7096db3f1e0ae58ebd144d2a3ae848`
BLAKE2b-256	`7eae7fad8050d300c00da26a51c66976fceaa5ba61eb48b7ebc129e1c21d0839`

See more details on using hashes here.

scrapy-twostage 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Downloading and installing from PyPI

Downloading and installing from source

Using the development version

Using scrapy-twostage

Bug tracker

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes