Skip to main content

Scrapy contrib for Airflow

Project description

Scrapy contrib for Airflow

Installation

pip install airscrapy

Airflow Operator

This operator runs Scrapy directly within the worker process by invoking the Scrapy engine directly, eliminating the need for a separate process.

Example

If the spider is structured as follows:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [ "http://example.com" ]

    def parse(self, response):
        yield {
            'text': response.css('.info').extract_first()
        }

Here’s how you can create a DAG using the operator:

from airflow import DAG
from airscrapy import ScrapyOperator
from myscrapers.spiders.example import ExampleSpider
import os

with DAG(
    dag_id="scrapers",
        # Add extra settings like credentials or token
        params={
            "extra_settings": {
                "CONCURRENT_REQUESTS": 2,
            }
        },
) as dag:
    # Import the shared settings file
    os.environ["SCRAPY_SETTINGS_MODULE"] = "myscrapers.settings"

    task = ScrapyOperator(spider=ExampleSpider)

if __name__ == "__main__":
    dag.test()

The extra_settings parameter is used to dynamically include elements such as credentials or tokens, complementing the settings.py file.

Additionally, ensure you set the SCRAPY_SETTINGS_MODULE environment variable. Without it, Scrapy won't be able to locate the settings.

The DAG directory is organized as follows:

dags
|- myscrapers
   |- spiders
      |- __init__.py
      |- example.py
   |- __init__.py
   |- items.py
   |- middlewares.py
   |- pipelines.py
   |- settings.py
|- mydag.py
|- scrapy.cfg

This structure enables us to run the DAG in local debugging mode:

python mydag.py

Build for publish

Install dependencies:

pip install build twine

Build the package:

python -m build --outdir dist

And publish to PyPi:

python -m twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airscrapy-1.0.1.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airscrapy-1.0.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file airscrapy-1.0.1.tar.gz.

File metadata

  • Download URL: airscrapy-1.0.1.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for airscrapy-1.0.1.tar.gz
Algorithm Hash digest
SHA256 669727bd1ce2027a934ea6d9d9c53c0557e0b350e60d6792bc117444435b96b4
MD5 c180ecebebdf852f158f1fda5cb1faad
BLAKE2b-256 9bfc8f425779e7ea6e70bbdb4d3e47231d71311d89263ebfc362231c0c77f52f

See more details on using hashes here.

File details

Details for the file airscrapy-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: airscrapy-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 3.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for airscrapy-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4395d5f31e3032abb5a9e1485e3ba772b9c8827a6dc2b6784f9027af3bf63c5c
MD5 0cb349a53396827164f6ba6fd45f8456
BLAKE2b-256 2cfd9c07754456db5192e1eca07c450c1ac4bd4e7bdcc850587a64fb2b41af41

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page