Skip to main content

Restrict crawl and scraping scope using matchers.

Project description

https://travis-ci.org/scrapinghub/scrapy-mosquitera.svg?branch=master https://img.shields.io/pypi/v/scrapy-mosquitera.svg?maxAge=2592000 https://img.shields.io/pypi/pyversions/scrapy-mosquitera.svg?maxAge=2592000 https://img.shields.io/pypi/l/scrapy-mosquitera.svg?maxAge=2592000

How can I scrape items off a site from the last five days?

—Scrapy User

That question started the development of scrapy-mosquitera, a tool to help you restrict crawling and scraping scope using matchers.

Matchers are simple Python functions that return the validity of an element under certain restrictions.

The first goal in the project was date matching, but you can create your own matcher for your own crawling and scraping needs.

How it works

In the case where the dates are available in the URLs, you will just use the matcher function directly in your code:

from scrapy_mosquitera.matchers import date_matches

 date = scrape_date_from_url(url)

 if date_matches(data=date, after='5 days ago'):
    yield Request(url=url, callback=self.parse_item)

To handle the case when the date is only available at the time when you scrape the items, scrapy-mosquitera provides a PaginationMixin to control the crawl according to the dates scraped.

Head on to the remaining of the documentation for more details.

Installation

The quick way:

pip install scrapy-mosquitera

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-mosquitera-0.1.1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

scrapy_mosquitera-0.1.1-py2.py3-none-any.whl (8.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-mosquitera-0.1.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-mosquitera-0.1.1.tar.gz
Algorithm Hash digest
SHA256 2ba3752240999a9111851b0cd0e4d31e3f073cbd241bd7afcc64db420d0b62b7
MD5 e7d52f82e90ad06f0b882db4c1d9db1a
BLAKE2b-256 0b6d4edc4532bc7181299cbee894b460d44b0b26d57ce09fce637077683735ad

See more details on using hashes here.

File details

Details for the file scrapy_mosquitera-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_mosquitera-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 92472f527dfb33efcc6733641de622c0537b71ee89a14111fb651c8f6c4d2a70
MD5 d8201af7533690b9db7bb70ceb3b1e8f
BLAKE2b-256 b079d188e5de92c8699480fa464867982c50d0728408e54be275c14524a1aec3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page