Project description

scrapy-item-pipelines

Various scrapy item pipelines

Installation

pip install scrapy-item-pipelines

SaveToKafkaPipeline

Item pipeline to push items to kafka. Items will be converted into JSON format and pushed to a defined kafka topic.

Settings

SL_SCRAPY_ITEM_PIPELINES_SETTINGS = {
    "push_to_kafka_hosts": "localhost:9092"  # Kafka broker hosts. Separated with a comma.
    "push_to_kafka_default_topic": ""  # kafka default topic.
}

Usage

If items should be pushed to different kafka topics per item, the topic can be defined in the item class. Also if a data key should be pushed to kafka we can define the item field value to use by defining it in the item class. If no kafka_data_key is defined no data key will be pushed.

class DemoItem(scrapy.Item):
    kafka_topic = "topic-to-push-items"
    kafka_data_key = "another_unique_field"

    field_name = scrapy.Field()
    another_unique_field = scrapy.Field()

After configuring add scrapy_item_pipelines.streaming.PushToKafkaPipeline to the ITEM_PIPELINES setting.

ITEM_PIPELINES = {
    ...
    ...
    "scrapy_item_pipelines.streaming.PushToKafkaPipeline": 999,
}

FilterDuplicatesPipeline

Item pipeline to filter out duplicate items calculated using defined keys in the item.

Usage

Define an attribute called unique_key in the item. If the unique key is a single field unique_key can be defined as a string or if the unique key is a multi field key unique_key should be a tuple of strings. If no unique_key is defined filtering will be done using id field. If you want to skip duplicate filtering for an item define unique_key as None.

The pipeline will include a stats called duplicate_item_count which is the number of duplicate items dropped.

class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = None  # duplicates won't be filtered.


class DemoItem(scrapy.Item):
    # No unique_key is defined. Filtering will be done using `id` field.
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    id = scrapy.Field()


class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = "field1"  # Duplicates will be filtered using field1.


class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = ("field1", "field2")  # Duplicates will be filtered using both field1 and field2

Add scrapy_item_pipelines.misc.FilterDuplicatesPipeline to the ITEM_PIPELINES setting.

ITEM_PIPELINES = {
    ...
    ...
    "scrapy_item_pipelines.misc.FilterDuplicatesPipeline": 500,
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

0.2

Jun 20, 2021

0.1

Jun 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-item-pipelines-0.2.tar.gz (5.3 kB view hashes)

Uploaded Jun 20, 2021 Source

Built Distribution

scrapy_item_pipelines-0.2-py3-none-any.whl (5.0 kB view hashes)

Uploaded Jun 20, 2021 Python 3

Hashes for scrapy-item-pipelines-0.2.tar.gz

Hashes for scrapy-item-pipelines-0.2.tar.gz
Algorithm	Hash digest
SHA256	`bdbcb1a64aa1190eab33da41f493f18dc289f7d37e7a2680c7659f9816fee0c9`
MD5	`8c2456aff66c00af4f2a3672c41a4430`
BLAKE2b-256	`25ec99d8721a694949cc79cbb6fc69694be58b5d5832c2a46c4a9da504a8f4de`

Hashes for scrapy_item_pipelines-0.2-py3-none-any.whl

Hashes for scrapy_item_pipelines-0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`72fb9d42e4bb83450aadc8d370e2c33a957543b0807db460ef65002122174dac`
MD5	`70d3edca16580e220129817331478c13`
BLAKE2b-256	`d23a28f79377dc83ff367898d69645b19ee5c60f25452cc01b99435767f20c01`