Collection of reusable scrapy item pipelines
Project description
scrapy-item-pipelines
Various scrapy item pipelines
Installation
pip install scrapy-item-pipelines
SaveToKafkaPipeline
Item pipeline to push items to kafka. Items will be converted into JSON format and pushed to a defined kafka topic.
Settings
SL_SCRAPY_ITEM_PIPELINES_SETTINGS = {
"push_to_kafka_hosts": "localhost:9092" # Kafka broker hosts. Separated with a comma.
"push_to_kafka_default_topic": "" # kafka default topic.
}
Usage
If items should be pushed to different kafka topics per item, the topic can be defined in the item class.
Also if a data key should be pushed to kafka we can define the item field value to use by defining it
in the item class. If no kafka_data_key
is defined no data key will be pushed.
class DemoItem(scrapy.Item):
kafka_topic = "topic-to-push-items"
kafka_data_key = "another_unique_field"
field_name = scrapy.Field()
another_unique_field = scrapy.Field()
After configuring add scrapy_item_pipelines.streaming.PushToKafkaPipeline
to the ITEM_PIPELINES setting.
ITEM_PIPELINES = {
...
...
"scrapy_item_pipelines.streaming.PushToKafkaPipeline": 999,
}
FilterDuplicatesPipeline
Item pipeline to filter out duplicate items calculated using defined keys in the item.
Usage
Define an attribute called unique_key in the item. If the unique key is a single field
unique_key can be defined as a string or if the unique key is a multi field key unique_key
should be a tuple of strings. If no unique_key is defined filtering will be done using id
field.
If you want to skip duplicate filtering for an item define unique_key as None.
The pipeline will include a stats called duplicate_item_count
which is the number
of duplicate items dropped.
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = None # duplicates won't be filtered.
class DemoItem(scrapy.Item):
# No unique_key is defined. Filtering will be done using `id` field.
field1 = scrapy.Field()
field2 = scrapy.Field()
id = scrapy.Field()
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = "field1" # Duplicates will be filtered using field1.
class DemoItem(scrapy.Item):
field1 = scrapy.Field()
field2 = scrapy.Field()
unique_key = ("field1", "field2") # Duplicates will be filtered using both field1 and field2
Add scrapy_item_pipelines.misc.FilterDuplicatesPipeline
to the ITEM_PIPELINES setting.
ITEM_PIPELINES = {
...
...
"scrapy_item_pipelines.misc.FilterDuplicatesPipeline": 500,
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for scrapy-item-pipelines-0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | bdbcb1a64aa1190eab33da41f493f18dc289f7d37e7a2680c7659f9816fee0c9 |
|
MD5 | 8c2456aff66c00af4f2a3672c41a4430 |
|
BLAKE2b-256 | 25ec99d8721a694949cc79cbb6fc69694be58b5d5832c2a46c4a9da504a8f4de |
Hashes for scrapy_item_pipelines-0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72fb9d42e4bb83450aadc8d370e2c33a957543b0807db460ef65002122174dac |
|
MD5 | 70d3edca16580e220129817331478c13 |
|
BLAKE2b-256 | d23a28f79377dc83ff367898d69645b19ee5c60f25452cc01b99435767f20c01 |