Skip to main content

Collection of reusable scrapy item pipelines

Project description

scrapy-item-pipelines

Various scrapy item pipelines

Installation

pip install scrapy-item-pipelines

SaveToKafkaPipeline

Item pipeline to push items to kafka. Items will be converted into JSON format and pushed to a defined kafka topic.

Settings

SL_SCRAPY_ITEM_PIPELINES_SETTINGS = {
    "push_to_kafka_hosts": "localhost:9092"  # Kafka broker hosts. Separated with a comma.
    "push_to_kafka_default_topic": ""  # kafka default topic.
}

Usage

If items should be pushed to different kafka topics per item, the topic can be defined in the item class. Also if a data key should be pushed to kafka we can define the item field value to use by defining it in the item class. If no kafka_data_key is defined no data key will be pushed.

class DemoItem(scrapy.Item):
    kafka_topic = "topic-to-push-items"
    kafka_data_key = "another_unique_field"

    field_name = scrapy.Field()
    another_unique_field = scrapy.Field()

After configuring add scrapy_item_pipelines.streaming.PushToKafkaPipeline to the ITEM_PIPELINES setting.

ITEM_PIPELINES = {
    ...
    ...
    "scrapy_item_pipelines.streaming.PushToKafkaPipeline": 999,
}

FilterDuplicatesPipeline

Item pipeline to filter out duplicate items calculated using defined keys in the item.

Usage

Define an attribute called unique_key in the item. If the unique key is a single field unique_key can be defined as a string or if the unique key is a multi field key unique_key should be a tuple of strings. If no unique_key is defined filtering will be done using id field. If you want to skip duplicate filtering for an item define unique_key as None.

The pipeline will include a stats called duplicate_item_count which is the number of duplicate items dropped.

class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = None  # duplicates won't be filtered.


class DemoItem(scrapy.Item):
    # No unique_key is defined. Filtering will be done using `id` field.
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    id = scrapy.Field()


class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = "field1"  # Duplicates will be filtered using field1.


class DemoItem(scrapy.Item):
    field1 = scrapy.Field()
    field2 = scrapy.Field()

    unique_key = ("field1", "field2")  # Duplicates will be filtered using both field1 and field2

Add scrapy_item_pipelines.misc.FilterDuplicatesPipeline to the ITEM_PIPELINES setting.

ITEM_PIPELINES = {
    ...
    ...
    "scrapy_item_pipelines.misc.FilterDuplicatesPipeline": 500,
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-item-pipelines-0.2.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

scrapy_item_pipelines-0.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-item-pipelines-0.2.tar.gz.

File metadata

  • Download URL: scrapy-item-pipelines-0.2.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.25.1

File hashes

Hashes for scrapy-item-pipelines-0.2.tar.gz
Algorithm Hash digest
SHA256 bdbcb1a64aa1190eab33da41f493f18dc289f7d37e7a2680c7659f9816fee0c9
MD5 8c2456aff66c00af4f2a3672c41a4430
BLAKE2b-256 25ec99d8721a694949cc79cbb6fc69694be58b5d5832c2a46c4a9da504a8f4de

See more details on using hashes here.

File details

Details for the file scrapy_item_pipelines-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_item_pipelines-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 72fb9d42e4bb83450aadc8d370e2c33a957543b0807db460ef65002122174dac
MD5 70d3edca16580e220129817331478c13
BLAKE2b-256 d23a28f79377dc83ff367898d69645b19ee5c60f25452cc01b99435767f20c01

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page