Skip to main content

RSS Tools for Scrapy Framework

Project description

PyPI Version Build Status Wheel Status Coverage report

Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.

Package works with Python 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8 and 3.9.

If you use Python 3.3 then you have to use Scrapy<1.5.0.

If you use Python 2.7 then you have to use Scrapy<2.0.

Table of Contents

Installation

  • Install scrapy_rss using pip

    pip install scrapy_rss

    or using pip for the specific interpreter, e.g.:

    pip3 install scrapy_rss
  • or using setuptools directly:

    cd path/to/root/of/scrapy_rss
    python setup.py install

    or using setuptools for specific interpreter, e.g.:

    cd path/to/root/of/scrapy_rss
    python3 setup.py install

How To Use

Configuration

Add parameters to the Scrapy project settings (settings.py file) or to the custom_settings attribute of the spider:

  1. Add item pipeline that export items to rss feed:

    ITEM_PIPELINES = {
        # ...
        'scrapy_rss.pipelines.RssExportPipeline': 900,  # or another priority
        # ...
    }
  2. Add required feed parameters:

    FEED_FILE

    the absolute or relative file path where the result RSS feed will be saved. For example, feed.rss or output/feed.rss.

    FEED_TITLE

    the name of the channel (feed),

    FEED_DESCRIPTION

    the phrase or sentence that describes the channel (feed),

    FEED_LINK

    the URL to the HTML website corresponding to the channel (feed)

    FEED_FILE = 'path/to/feed.rss'
    FEED_TITLE = 'Some title of the channel'
    FEED_LINK = 'http://example.com/rss'
    FEED_DESCRIPTION = 'About channel'

Feed (Channel) Elements Customization [optionally]

If you want to change other channel parameters (such as language, copyright, managing_editor, webmaster, pubdate, last_build_date, category, generator, docs, ttl) then define your own exporter that’s inherited from RssItemExporter class, for example:

from scrapy_rss.exporters import RssItemExporter

class MyRssItemExporter(RssItemExporter):
   def __init__(self, *args, **kwargs):
      kwargs['generator'] = kwargs.get('generator', 'Special generator')
      kwargs['language'] = kwargs.get('language', 'en-us')
      super(CustomRssItemExporter, self).__init__(*args, **kwargs)

And add FEED_EXPORTER parameter to the Scrapy project settings or to the custom_settings attribute of the spider:

FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'

Usage

Basic usage

Declare your item directly as RssItem():

import scrapy_rss

item1 = scrapy_rss.RssItem()

Or use predefined item class RssedItem with RSS field named as rss that’s instance of RssItem:

import scrapy
import scrapy_rss

class MyItem(scrapy_rss.RssedItem):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...

item2 = MyItem()

Set/get item fields. Case sensitive attributes of RssItem() are appropriate to RSS elements. Attributes of RSS elements are case sensitive too. If the editor allows autocompletion then it suggests attributes for instances of RssedItem and RssItem. It’s allowed to set any subset of RSS elements (e.g. title only). For example:

from datetime import datetime

item1.title = 'RSS item title'  # set value of <title> element
title = item1.title.title  # get value of <title> element
item1.description = 'description'

item1.guid = 'item identifier'
item1.guid.isPermaLink = True  # set value of attribute isPermalink of <guid> element,
                               # isPermaLink is False by default
is_permalink = item1.guid.isPermaLink  # get value of attribute isPermalink of <guid> element
guid = item1.guid.guid  # get value of element <guid>

item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].category # get value of the element <category> with multiple values
all_categories = [cat.category for cat in item1.category]

# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'

# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'guid': 'item identifier', 'isPermaLink': True}

item1.pubDate = datetime.now()  # correctly works with Python' datetimes


item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}

All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.

RssItem derivation and namespaces

You can extend RssItem to add new XML fields that can be namespaced or not. You can specify namespaces in an attribute and/or an element constructors. Namespace prefix can be specified in the attribute/element name using double underscores as delimiter (prefix__name) or in the attribute/element constructor using ns_prefix argument. Namespace URI can be specified using ns_uri argument of the constructor.

from scrapy_rss.meta import ItemElementAttribute, ItemElement
from scrapy_rss.items import RssItem

class Element0(ItemElement):
    # attributes without special namespace
    attr0 = ItemElementAttribute(is_content=True, required=True)
    attr1 = ItemElementAttribute()

class Element1(ItemElement):
    # attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
    attr2 = ItemElementAttribute(ns_prefix="prefix2", ns_uri="id2")

    # attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
    prefix3__attr3 = ItemElementAttribute(ns_uri="id3")

    # attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
    fake_prefix__attr4 = ItemElementAttribute(ns_prefix="prefix4", ns_uri="id4")

    # attribute "attr5" with default namespace xmlns="id5"
    attr5 = ItemElementAttribute(ns_uri="id5")

class MyXMLItem(RssItem):
    # element <elem1> without namespace
    elem1 = Element0()

    # element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
    elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")

    # element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
    elem_prefix3__elem3 = Element1(ns_uri="id3e")

    # yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
    # (does not conflict with previous one)
    fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")

    # element <elem5> with default namespace xmlns="id5e"
    elem5 = Element0(ns_uri="id5e")

Access to elements and its attributes is the same as with simple items:

item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42

Several optional settings are allowed for namespaced items:

FEED_NAMESPACES

list of tuples [(prefix, URI), ...] or dictionary {prefix: URI, ...} of namespaces that must be defined in the root XML element

FEED_ITEM_CLASS or FEED_ITEM_CLS

main class of feed items (class object MyXMLItem or path to class "path.to.MyXMLItem"). Default value: RssItem. It’s used in order to extract all possible namespaces that will be declared in the root XML element.

Feed items do NOT have to be instances of this class or its subclass.

If these settings are not defined or only part of namespaces are defined then other used namespaces will be declared either in the <item> element or in its subelements when these namespaces are not unique. Each <item> element and its sublements always contains only namespace declarations of non-None attributes (including ones that are interpreted as element content).

Scrapy Project Examples

Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.

Just go to the Scrapy project directory and run commands

scrapy crawl first_spider
scrapy crawl second_spider

Thereafter feed.rss and feed2.rss files will be created in the same directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-rss-0.2.3.tar.gz (38.0 kB view details)

Uploaded Source

Built Distributions

scrapy_rss-0.2.3-py39-none-any.whl (21.4 kB view details)

Uploaded Python 3.9

scrapy_rss-0.2.3-py38-none-any.whl (21.4 kB view details)

Uploaded Python 3.8

scrapy_rss-0.2.3-py37-none-any.whl (21.4 kB view details)

Uploaded Python 3.7

scrapy_rss-0.2.3-py36-none-any.whl (21.4 kB view details)

Uploaded Python 3.6

scrapy_rss-0.2.3-py35-none-any.whl (21.4 kB view details)

Uploaded Python 3.5

scrapy_rss-0.2.3-py34-none-any.whl (21.4 kB view details)

Uploaded Python 3.4

scrapy_rss-0.2.3-py33-none-any.whl (25.0 kB view details)

Uploaded Python 3.3

scrapy_rss-0.2.3-py27-none-any.whl (21.4 kB view details)

Uploaded Python 2.7

File details

Details for the file scrapy-rss-0.2.3.tar.gz.

File metadata

  • Download URL: scrapy-rss-0.2.3.tar.gz
  • Upload date:
  • Size: 38.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy-rss-0.2.3.tar.gz
Algorithm Hash digest
SHA256 de3df4400f5bd8152f577cfeaab56bc9ab53cf17fae691184e3a3d99f96a6202
MD5 9fa952ba44c4681a30245c00c702d478
BLAKE2b-256 ec99f39f91fce4bc70a83986950b9c2e93affb2a8ffeeb1e27fdb80bad6aff9a

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py39-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py39-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py39-none-any.whl
Algorithm Hash digest
SHA256 755ecbc2ea8078b692ba012f76f0d5559115f34fc9c36262e72b9152a42d732f
MD5 0a67c115210e56b06025990c77b68b13
BLAKE2b-256 13f3699862d39c4a5edc71569e05029464bb2c48f81ccab77bb273191eeb1608

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py38-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py38-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py38-none-any.whl
Algorithm Hash digest
SHA256 4b0e4afc82d2fa13ac0b8ae41d4ba6ac4aabdf9a211984b4af6e36e259e9aa61
MD5 20552990641136c248129ce2703aeccf
BLAKE2b-256 f01176417292161e5f749c2dc33304a1e0f5b51a1fc1d775dca60031219df6aa

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py37-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py37-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py37-none-any.whl
Algorithm Hash digest
SHA256 b742ddf16e4b3474a1257a2adc8677069fcd8a6bb39dd7fc238b48958a3c9c86
MD5 09f7d6f79d9445aa5c1722d67fba0d53
BLAKE2b-256 d3f0cb8d5b5e40f67ed47bfc3c3f10f05caab6f423ff4dd90e878863f1c2660d

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py36-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py36-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.6
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py36-none-any.whl
Algorithm Hash digest
SHA256 6aa8dd8e0aa3614663e7ef00fa3e421afd343d0ecf97c3ef7fe472898819e3db
MD5 1f5bd44c01504b168ef6427cd24a0c3f
BLAKE2b-256 6cf24c2d9d584856232d36e6e9b1f14a8ced4ad08fc856478c8260f3e68ff82e

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py35-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py35-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.5
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py35-none-any.whl
Algorithm Hash digest
SHA256 dcb24daaed172dd43978e438ed5f9d06c7c890734536e938e2e999e3ba5b0d34
MD5 646b1cb0368a950aa4de5e26135d634c
BLAKE2b-256 02a9d24dc7a57765e0f8dd0cbfb4e81efb527ba70db87fd611b8a81760c45cc4

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py34-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py34-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.4
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py34-none-any.whl
Algorithm Hash digest
SHA256 85532d079f6cf1813c9c333a9aafe9de05ca0af9b78fece682af0013b8c7c305
MD5 d82025b7aea07781a6ebc84749431e5f
BLAKE2b-256 0dfb796c6dd80525240e7a0d4346bf52b84bcf192dd5547673864d6b15a4ac80

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py33-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py33-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3.3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py33-none-any.whl
Algorithm Hash digest
SHA256 5345da1ad248270912bf052dc60a72c86296e0ee3a78ce06789339e4405c5b55
MD5 1f83f1a62bd5f70c7289cd591bfe30c0
BLAKE2b-256 e52b56776bbaaa7730d2e22ecffc30993c1e0e084efcd308aa55da24474503d8

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.3-py27-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.3-py27-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 2.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for scrapy_rss-0.2.3-py27-none-any.whl
Algorithm Hash digest
SHA256 bcc5bc68f6dd3cb8f52023923d88b0099f5b1a9214e771cad236b8b3b6936c5d
MD5 d33c4c82f7e6880fb9b930b5dc64f0ae
BLAKE2b-256 1d3904acb8da29d5b41cd949bf6abe6df75014af1a81576e1224d3bb9070d872

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page