Skip to main content

Restrict crawl and scraping scope using matchers.

Project description

https://travis-ci.org/scrapinghub/scrapy-mosquitera.svg?branch=master https://img.shields.io/pypi/v/scrapy-mosquitera.svg?maxAge=2592000 https://img.shields.io/pypi/pyversions/scrapy-mosquitera.svg?maxAge=2592000 https://img.shields.io/pypi/l/scrapy-mosquitera.svg?maxAge=2592000

How can I scrape items off a site from the last five days?

—Scrapy User

That question started the development of scrapy-mosquitera, a tool to help you restrict crawling and scraping scope using matchers.

Matchers are simple Python functions that return the validity of an element under certain restrictions.

The first goal in the project was date matching, but you can create your own matcher for your own crawling and scraping needs.

How it works

In the case where the dates are available in the URLs, you will just use the matcher function directly in your code:

from scrapy_mosquitera.matchers import date_matches

 date = scrape_date_from_url(url)

 if date_matches(data=date, after='5 days ago'):
    yield Request(url=url, callback=self.parse_item)

To handle the case when the date is only available at the time when you scrape the items, scrapy-mosquitera provides a PaginationMixin to control the crawl according to the dates scraped.

Head on to the remaining of the documentation for more details.

Installation

The quick way:

pip install scrapy-mosquitera

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
scrapy_mosquitera-0.1.1-py2.py3-none-any.whl (8.7 kB) Copy SHA256 hash SHA256 Wheel py2.py3 May 19, 2016
scrapy-mosquitera-0.1.1.tar.gz (18.4 kB) Copy SHA256 hash SHA256 Source None May 19, 2016

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page