Skip to main content

RSS Tools for Scrapy Framework

Project description

PyPI Version Build Status Wheel Status Coverage report

Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.

Package works with Python 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8 and 3.9.

If you use Python 3.3 then you have to use Scrapy<1.5.0.

If you use Python 2.7 then you have to use Scrapy<2.0.

Table of Contents

Installation

  • Install scrapy_rss using pip

    pip install scrapy_rss

    or using pip for the specific interpreter, e.g.:

    pip3 install scrapy_rss
  • or using setuptools directly:

    cd path/to/root/of/scrapy_rss
    python setup.py install

    or using setuptools for specific interpreter, e.g.:

    cd path/to/root/of/scrapy_rss
    python3 setup.py install

How To Use

Configuration

Add parameters to the Scrapy project settings (settings.py file) or to the custom_settings attribute of the spider:

  1. Add item pipeline that export items to rss feed:

    ITEM_PIPELINES = {
        # ...
        'scrapy_rss.pipelines.RssExportPipeline': 900,  # or another priority
        # ...
    }
  2. Add required feed parameters:

    FEED_FILE

    the absolute or relative file path where the result RSS feed will be saved. For example, feed.rss or output/feed.rss.

    FEED_TITLE

    the name of the channel (feed),

    FEED_DESCRIPTION

    the phrase or sentence that describes the channel (feed),

    FEED_LINK

    the URL to the HTML website corresponding to the channel (feed)

    FEED_FILE = 'path/to/feed.rss'
    FEED_TITLE = 'Some title of the channel'
    FEED_LINK = 'http://example.com/rss'
    FEED_DESCRIPTION = 'About channel'

Feed (Channel) Elements Customization [optionally]

If you want to change other channel parameters (such as language, copyright, managing_editor, webmaster, pubdate, last_build_date, category, generator, docs, ttl) then define your own exporter that’s inherited from RssItemExporter class, for example:

from scrapy_rss.exporters import RssItemExporter

class MyRssItemExporter(RssItemExporter):
   def __init__(self, *args, **kwargs):
      kwargs['generator'] = kwargs.get('generator', 'Special generator')
      kwargs['language'] = kwargs.get('language', 'en-us')
      super(CustomRssItemExporter, self).__init__(*args, **kwargs)

And add FEED_EXPORTER parameter to the Scrapy project settings or to the custom_settings attribute of the spider:

FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'

Usage

Basic usage

Declare your item directly as RssItem():

import scrapy_rss

item1 = scrapy_rss.RssItem()

Or use predefined item class RssedItem with RSS field named as rss that’s instance of RssItem:

import scrapy
import scrapy_rss

class MyItem(scrapy_rss.RssedItem):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...

item2 = MyItem()

Set/get item fields. Case sensitive attributes of RssItem() are appropriate to RSS elements. Attributes of RSS elements are case sensitive too. If the editor allows autocompletion then it suggests attributes for instances of RssedItem and RssItem. It’s allowed to set any subset of RSS elements (e.g. title only). For example:

from datetime import datetime

item1.title = 'RSS item title'  # set value of <title> element
title = item1.title.title  # get value of <title> element
item1.description = 'description'

item1.guid = 'item identifier'
item1.guid.isPermaLink = True  # set value of attribute isPermalink of <guid> element,
                               # isPermaLink is False by default
is_permalink = item1.guid.isPermaLink  # get value of attribute isPermalink of <guid> element
guid = item1.guid.guid  # get value of element <guid>

item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].category # get value of the element <category> with multiple values
all_categories = [cat.category for cat in item1.category]

# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'

# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'guid': 'item identifier', 'isPermaLink': True}

item1.pubDate = datetime.now()  # correctly works with Python' datetimes


item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}

All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.

RssItem derivation and namespaces

You can extend RssItem to add new XML fields that can be namespaced or not. You can specify namespaces in an attribute and/or an element constructors. Namespace prefix can be specified in the attribute/element name using double underscores as delimiter (prefix__name) or in the attribute/element constructor using ns_prefix argument. Namespace URI can be specified using ns_uri argument of the constructor.

from scrapy_rss.meta import ItemElementAttribute, ItemElement
from scrapy_rss.items import RssItem

class Element0(ItemElement):
    # attributes without special namespace
    attr0 = ItemElementAttribute(is_content=True, required=True)
    attr1 = ItemElementAttribute()

class Element1(ItemElement):
    # attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
    attr2 = ItemElementAttribute(ns_prefix="prefix2", ns_uri="id2")

    # attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
    prefix3__attr3 = ItemElementAttribute(ns_uri="id3")

    # attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
    fake_prefix__attr4 = ItemElementAttribute(ns_prefix="prefix4", ns_uri="id4")

    # attribute "attr5" with default namespace xmlns="id5"
    attr5 = ItemElementAttribute(ns_uri="id5")

class MyXMLItem(RssItem):
    # element <elem1> without namespace
    elem1 = Element0()

    # element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
    elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")

    # element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
    elem_prefix3__elem3 = Element1(ns_uri="id3e")

    # yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
    # (does not conflict with previous one)
    fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")

    # element <elem5> with default namespace xmlns="id5e"
    elem5 = Element0(ns_uri="id5e")

Access to elements and its attributes is the same as with simple items:

item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42

Several optional settings are allowed for namespaced items:

FEED_NAMESPACES

list of tuples [(prefix, URI), ...] or dictionary {prefix: URI, ...} of namespaces that must be defined in the root XML element

FEED_ITEM_CLASS or FEED_ITEM_CLS

main class of feed items (class object MyXMLItem or path to class "path.to.MyXMLItem"). Default value: RssItem. It’s used in order to extract all possible namespaces that will be declared in the root XML element.

Feed items do NOT have to be instances of this class or its subclass.

If these settings are not defined or only part of namespaces are defined then other used namespaces will be declared either in the <item> element or in its subelements when these namespaces are not unique. Each <item> element and its sublements always contains only namespace declarations of non-None attributes (including ones that are interpreted as element content).

Scrapy Project Examples

Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.

Just go to the Scrapy project directory and run commands

scrapy crawl first_spider
scrapy crawl second_spider

Thereafter feed.rss and feed2.rss files will be created in the same directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-rss-0.2.0.tar.gz (37.6 kB view details)

Uploaded Source

Built Distributions

scrapy_rss-0.2.0-py39-none-any.whl (21.4 kB view details)

Uploaded Python 3.9

scrapy_rss-0.2.0-py38-none-any.whl (21.4 kB view details)

Uploaded Python 3.8

scrapy_rss-0.2.0-py37-none-any.whl (21.4 kB view details)

Uploaded Python 3.7

scrapy_rss-0.2.0-py36-none-any.whl (21.4 kB view details)

Uploaded Python 3.6

scrapy_rss-0.2.0-py35-none-any.whl (21.4 kB view details)

Uploaded Python 3.5

scrapy_rss-0.2.0-py34-none-any.whl (21.4 kB view details)

Uploaded Python 3.4

scrapy_rss-0.2.0-py33-none-any.whl (25.0 kB view details)

Uploaded Python 3.3

scrapy_rss-0.2.0-py27-none-any.whl (21.4 kB view details)

Uploaded Python 2.7

File details

Details for the file scrapy-rss-0.2.0.tar.gz.

File metadata

  • Download URL: scrapy-rss-0.2.0.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy-rss-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bd4b8f7bb1c5a3882d17a59432630836192ea3d8db782c53057d577c75792ba1
MD5 45380903aa80cac193a9557e587b6e39
BLAKE2b-256 23e034f9841e16fa46ac55f224d0133e063cb574601f12d7b2c93abf041b7230

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py39-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py39-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py39-none-any.whl
Algorithm Hash digest
SHA256 b18ccf73fc5514b8eba7c84fc3f5c57ab06ecce6d1dfe8095b5e0275c06159e7
MD5 d5d8e19259ef24b59f8e62dd9fa67553
BLAKE2b-256 dfe96b1f67e2cb8e071376d4c7dace7989d989c96796cc94a6a5b11141baed56

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py38-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py38-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py38-none-any.whl
Algorithm Hash digest
SHA256 6e194348a95249b63f10977bf507326184592315b1624ab64d8287823a431845
MD5 c696dfea1548cbcecca48d5dadd39696
BLAKE2b-256 3eb9447c28a78010b4d1637af56b84fb6d8b28ef334cb408c7a18dea0a70adbf

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py37-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py37-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py37-none-any.whl
Algorithm Hash digest
SHA256 3dfbee47eecee1f287707abd06ff7a5bd495628879c3417e93aee0180e3fe3af
MD5 ad76919a505d01caada79a28500f90ac
BLAKE2b-256 a24c38210e5f93bdcb06ce65d55dfb10f535e7672bca1dfea856d05e6d514834

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py36-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py36-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.6
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py36-none-any.whl
Algorithm Hash digest
SHA256 fe785a380f52231e87199a961dee8da5be11b9666ae3d267356e12096e250850
MD5 721bebb66a15eef144b9a2e86e4394fb
BLAKE2b-256 152f364572921081979a75b762a9ee850aa1554fa3b67d4846e312232b4365bd

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py35-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py35-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.5
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py35-none-any.whl
Algorithm Hash digest
SHA256 9181d0688611b95b2ad9a13514b161e4760a010130a850b26db8c7c872ad541c
MD5 8519f482c8377d69c0e7a8f03392f102
BLAKE2b-256 41e50c8ecc9de282e251ca00fdb63d2aba9a3cddfa6bf3e115e0ba2f0da4ed26

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py34-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py34-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.4
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py34-none-any.whl
Algorithm Hash digest
SHA256 96d8c795d65ee91adff441582155063bdac06964fa1c575622f8154aeb7529b3
MD5 c83ae05dc3d1b13ebeebc7e4fa5f8c9a
BLAKE2b-256 68539a9eb07970bd6cf5ddebec20abebaa623f1181a8fb03d22eb032c1d22f91

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py33-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py33-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3.3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py33-none-any.whl
Algorithm Hash digest
SHA256 ae6fe71d58940bf33dec9e257fca5a09321f3041dcde3d745ef29e8d122084d6
MD5 3ba865d05c6b010ff5a98b261ff542e2
BLAKE2b-256 5a0085b82706c2c498e1f458316727a0606a7a5a6bb86654c24fd5c3fbae3e01

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.0-py27-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.0-py27-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 2.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.8.2

File hashes

Hashes for scrapy_rss-0.2.0-py27-none-any.whl
Algorithm Hash digest
SHA256 b9d0c28a18740b35ec8ea133d4928b48dab2f5d10b56e5f14a168f2e83f87520
MD5 5bb6e41954781167f2120e14f803e858
BLAKE2b-256 d64b13bc98f38733a6ae0ea216a3833d062cef14542578382343afe378b22769

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page