Scrapy contrib for Airflow
Project description
Scrapy contrib for Airflow
Installation
pip install airscrapy
Airflow Operator
This operator runs Scrapy directly within the worker process by invoking the Scrapy engine directly, eliminating the need for a separate process.
Example
If the spider is structured as follows:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = [ "http://example.com" ]
def parse(self, response):
yield {
'text': response.css('.info').extract_first()
}
Here’s how you can create a DAG using the operator:
from airflow import DAG
from airscrapy import ScrapyOperator
from myscrapers.spiders.example import ExampleSpider
import os
with DAG(
dag_id="scrapers",
# Add extra settings like credentials or token
params={
"extra_settings": {
"CONCURRENT_REQUESTS": 2,
}
},
) as dag:
# Import the shared settings file
os.environ["SCRAPY_SETTINGS_MODULE"] = "myscrapers.settings"
task = ScrapyOperator(spider=ExampleSpider)
if __name__ == "__main__":
dag.test()
The extra_settings parameter is used to dynamically include elements
such as credentials or tokens, complementing the settings.py file.
Additionally, ensure you set the SCRAPY_SETTINGS_MODULE environment variable.
Without it, Scrapy won't be able to locate the settings.
The DAG directory is organized as follows:
dags
|- myscrapers
|- spiders
|- __init__.py
|- example.py
|- __init__.py
|- items.py
|- middlewares.py
|- pipelines.py
|- settings.py
|- mydag.py
|- scrapy.cfg
This structure enables us to run the DAG in local debugging mode:
python mydag.py
Build for publish
Install dependencies:
pip install build twine
Build the package:
python -m build --outdir dist
And publish to PyPi:
python -m twine upload dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file airscrapy-1.0.1.tar.gz.
File metadata
- Download URL: airscrapy-1.0.1.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
669727bd1ce2027a934ea6d9d9c53c0557e0b350e60d6792bc117444435b96b4
|
|
| MD5 |
c180ecebebdf852f158f1fda5cb1faad
|
|
| BLAKE2b-256 |
9bfc8f425779e7ea6e70bbdb4d3e47231d71311d89263ebfc362231c0c77f52f
|
File details
Details for the file airscrapy-1.0.1-py3-none-any.whl.
File metadata
- Download URL: airscrapy-1.0.1-py3-none-any.whl
- Upload date:
- Size: 3.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4395d5f31e3032abb5a9e1485e3ba772b9c8827a6dc2b6784f9027af3bf63c5c
|
|
| MD5 |
0cb349a53396827164f6ba6fd45f8456
|
|
| BLAKE2b-256 |
2cfd9c07754456db5192e1eca07c450c1ac4bd4e7bdcc850587a64fb2b41af41
|