Skip to main content

Scrapy contrib for Airflow

Project description

Scrapy contrib for Airflow

Installation

pip install airscrapy

Airflow Operator

This operator runs Scrapy directly within the worker process by invoking the Scrapy engine directly, eliminating the need for a separate process.

Example

If the spider is structured as follows:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [ "http://example.com" ]

    def parse(self, response):
        yield {
            'text': response.css('.info').extract_first()
        }

Here’s how you can create a DAG using the operator:

from airflow import DAG
from airscrapy import ScrapyOperator
from myscrapers.spiders.example import ExampleSpider
import os

with DAG(
    dag_id="scrapers",
        # Add extra settings like credentials or token
        params={
            "extra_settings": {
                "CONCURRENT_REQUESTS": 2,
            }
        },
) as dag:
    # Import the shared settings file
    os.environ["SCRAPY_SETTINGS_MODULE"] = "myscrapers.settings"

    task = ScrapyOperator(spider=ExampleSpider)

if __name__ == "__main__":
    dag.test()

The extra_settings parameter is used to dynamically include elements such as credentials or tokens, complementing the settings.py file.

Additionally, ensure you set the SCRAPY_SETTINGS_MODULE environment variable. Without it, Scrapy won't be able to locate the settings.

The DAG directory is organized as follows:

dags
|- myscrapers
   |- spiders
      |- __init__.py
      |- example.py
   |- __init__.py
   |- items.py
   |- middlewares.py
   |- pipelines.py
   |- settings.py
|- mydag.py
|- scrapy.cfg

This structure enables us to run the DAG in local debugging mode:

python mydag.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airscrapy-1.0.0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airscrapy-1.0.0-py3-none-any.whl (3.5 kB view details)

Uploaded Python 3

File details

Details for the file airscrapy-1.0.0.tar.gz.

File metadata

  • Download URL: airscrapy-1.0.0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.3

File hashes

Hashes for airscrapy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 37a67a39b9f08d80b92fd2dd7c6d1fa4d356a5ef7200333ef5cec6c8c2b0663f
MD5 fafc887e00e5e4712e9c7845f2f18ae3
BLAKE2b-256 b2a613dd73129c30a624c8e27b6d611f232f771bd3690950cb50fe984bcfceb4

See more details on using hashes here.

File details

Details for the file airscrapy-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: airscrapy-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 3.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.3

File hashes

Hashes for airscrapy-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ccf53a1eacce1b4ef49411d885be5e4d509ed0bdf48156f60ebf70b3fbac8314
MD5 0cc532fcdba05d21206281e60fe8873f
BLAKE2b-256 f10270f3cecb2d069b94a1c42205e0df85795d70bbbf0922794d36f0269482ac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page