Collection of reusable scrapy item pipelines
Project description
scrapy-item-pipelines
Various scrapy item pipelines
Installation
pip install scrapy-item-pipelines
SaveToKafkaPipeline
Item pipeline to push items to kafka. Items will be converted into JSON format and pushed to a defined kafka topic.
Settings
SL_SCRAPY_ITEM_PIPELINES_SETTINGS = {
"push_to_kafka_hosts": "localhost:9092" # Kafka broker hosts. Separated with a comma.
"push_to_kafka_default_topic": "" # kafka default topic.
}
Usage
If items should be pushed to different kafka topics per item, the topic can be defined in the item class.
Also if a data key should be pushed to kafka we can define the item field value to use by defining it
in the item class. If no kafka_data_key
is defined no data key will be pushed.
class DemoItem(scrapy.Item):
kafka_topic = "topic-to-push-items"
kafka_data_key = "another_unique_field"
field_name = scrapy.Field()
another_unique_field = scrapy.Field()
After configuring add scrapy_item_pipelines.streaming.PushToKafkaPipeline
to the ITEM_PIPELINES setting.
ITEM_PIPELINES = {
...
...
"scrapy_item_pipelines.streaming.PushToKafkaPipeline": 999,
}
FilterDuplicatesPipeline
Item pipeline to filter out duplicate items calculated using defined keys in the item.
Usage
Define an attribute called unique_key in the item. If the unique key is a single field
unique_key can be defined as a string or if the unique key is a multi field key unique_key
should be a tuple of strings. If no unique_key is defined filtering will be done using id
field.
If you want to skip duplicate filtering for an item define unique_key as None.
The pipeline will include a stats called duplicate_item_count
which is the number
of duplicate items dropped.
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = None # duplicates won't be filtered.
class DemoItem(scrapy.Item):
# No unique_key is defined. Filtering will be done using `id` field.
field1 = scrapy.Field()
field2 = scrapy.Field()
id = scrapy.Field()
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = "field1" # Duplicates will be filtered using field1.
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = ("field1", "field2") # Duplicates will be filtered using both field1 and field2
Add scrapy_item_pipelines.misc.FilterDuplicatesPipeline
to the ITEM_PIPELINES setting.
ITEM_PIPELINES = {
...
...
"scrapy_item_pipelines.misc.FilterDuplicatesPipeline": 500,
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-item-pipelines-0.2.tar.gz
.
File metadata
- Download URL: scrapy-item-pipelines-0.2.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdbcb1a64aa1190eab33da41f493f18dc289f7d37e7a2680c7659f9816fee0c9 |
|
MD5 | 8c2456aff66c00af4f2a3672c41a4430 |
|
BLAKE2b-256 | 25ec99d8721a694949cc79cbb6fc69694be58b5d5832c2a46c4a9da504a8f4de |
File details
Details for the file scrapy_item_pipelines-0.2-py3-none-any.whl
.
File metadata
- Download URL: scrapy_item_pipelines-0.2-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72fb9d42e4bb83450aadc8d370e2c33a957543b0807db460ef65002122174dac |
|
MD5 | 70d3edca16580e220129817331478c13 |
|
BLAKE2b-256 | d23a28f79377dc83ff367898d69645b19ee5c60f25452cc01b99435767f20c01 |