Skip to main content

Spider Middleware that allows a Scrapy Spider to filter requests.

Project description

Scrapy-link-filter

Python 3.6 Code style: black

Spider Middleware that allows a Scrapy Spider to filter requests. There is similar functionality in the CrawlSpider already using Rules and in the RobotsTxtMiddleware, but there are twists. This middleware allows defining rules dinamically per spider, or job, or request.

Install

This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.

$ pip install git+https://github.com/croqaz/scrapy-link-filter

Usage

For the middleware to be enabled as a Spider Middleware, it must be added in the project settings.py:

SPIDER_MIDDLEWARES = {
    # maybe other Spider Middlewares ...
    # can go after DepthMiddleware: 900
    'scrapy_link_filter.middleware.LinkFilterMiddleware': 950,
}

Or, it can be enabled as a Downloader Middleware, in the project settings.py:

DOWNLOADER_MIDDLEWARES = {
    # maybe other Downloader Middlewares ...
    # can go before RobotsTxtMiddleware: 100
    'scrapy_link_filter.middleware.LinkFilterMiddleware': 50,
}

The rules must be defined either in the spider instance, in a spider.extract_rules dict, or per request, in request.meta['extract_rules']. Internally, the extract_rules dict is converted into a LinkExtractor, which is used to match the requests.

Example of a specific allow filter:

extract_rules = {"allow_domains": "example.com", "allow": "/en/items/"}

Or a specific deny filter:

extract_rules = {
    "deny_domains": ["whatever.com", "ignore.me"],
    "deny": ["/privacy-policy/?$", "/about-?(us)?$"]
}

The allowed fields are:

  • allow_domains and deny_domains - one, or more domains to specifically limit to, or specifically reject
  • allow and deny - one, or more sub-strings, or patterns to specifically allow, or reject

License

BSD3 © Cristi Constantin.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-link-filter-0.1.1.tar.gz (4.5 kB view hashes)

Uploaded source

Built Distribution

scrapy_link_filter-0.1.1-py3-none-any.whl (4.9 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page