Skip to main content

RSS Tools for Scrapy Framework

Project description

PyPI Version Build Status Wheel Status Coverage report

Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.

Package works with Python 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8 and 3.9.

If you use Python 3.3 then you have to use Scrapy<1.5.0.

If you use Python 2.7 then you have to use Scrapy<2.0.

Table of Contents

Installation

  • Install scrapy_rss using pip

    pip install scrapy_rss

    or using pip for the specific interpreter, e.g.:

    pip3 install scrapy_rss
  • or using setuptools directly:

    cd path/to/root/of/scrapy_rss
    python setup.py install

    or using setuptools for specific interpreter, e.g.:

    cd path/to/root/of/scrapy_rss
    python3 setup.py install

How To Use

Configuration

Add parameters to the Scrapy project settings (settings.py file) or to the custom_settings attribute of the spider:

  1. Add item pipeline that export items to rss feed:

    ITEM_PIPELINES = {
        # ...
        'scrapy_rss.pipelines.RssExportPipeline': 900,  # or another priority
        # ...
    }
  2. Add required feed parameters:

    FEED_FILE

    the absolute or relative file path where the result RSS feed will be saved. For example, feed.rss or output/feed.rss.

    FEED_TITLE

    the name of the channel (feed),

    FEED_DESCRIPTION

    the phrase or sentence that describes the channel (feed),

    FEED_LINK

    the URL to the HTML website corresponding to the channel (feed)

    FEED_FILE = 'path/to/feed.rss'
    FEED_TITLE = 'Some title of the channel'
    FEED_LINK = 'http://example.com/rss'
    FEED_DESCRIPTION = 'About channel'

Feed (Channel) Elements Customization [optionally]

If you want to change other channel parameters (such as language, copyright, managing_editor, webmaster, pubdate, last_build_date, category, generator, docs, ttl) then define your own exporter that’s inherited from RssItemExporter class, for example:

from scrapy_rss.exporters import RssItemExporter

class MyRssItemExporter(RssItemExporter):
   def __init__(self, *args, **kwargs):
      kwargs['generator'] = kwargs.get('generator', 'Special generator')
      kwargs['language'] = kwargs.get('language', 'en-us')
      super(CustomRssItemExporter, self).__init__(*args, **kwargs)

And add FEED_EXPORTER parameter to the Scrapy project settings or to the custom_settings attribute of the spider:

FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'

Usage

Basic usage

Declare your item directly as RssItem():

import scrapy_rss

item1 = scrapy_rss.RssItem()

Or use predefined item class RssedItem with RSS field named as rss that’s instance of RssItem:

import scrapy
import scrapy_rss

class MyItem(scrapy_rss.RssedItem):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...

item2 = MyItem()

Set/get item fields. Case sensitive attributes of RssItem() are appropriate to RSS elements. Attributes of RSS elements are case sensitive too. If the editor allows autocompletion then it suggests attributes for instances of RssedItem and RssItem. It’s allowed to set any subset of RSS elements (e.g. title only). For example:

from datetime import datetime

item1.title = 'RSS item title'  # set value of <title> element
title = item1.title.title  # get value of <title> element
item1.description = 'description'

item1.guid = 'item identifier'
item1.guid.isPermaLink = True  # set value of attribute isPermalink of <guid> element,
                               # isPermaLink is False by default
is_permalink = item1.guid.isPermaLink  # get value of attribute isPermalink of <guid> element
guid = item1.guid.guid  # get value of element <guid>

item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].category # get value of the element <category> with multiple values
all_categories = [cat.category for cat in item1.category]

# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'

# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'guid': 'item identifier', 'isPermaLink': True}

item1.pubDate = datetime.now()  # correctly works with Python' datetimes


item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}

All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.

RssItem derivation and namespaces

You can extend RssItem to add new XML fields that can be namespaced or not. You can specify namespaces in an attribute and/or an element constructors. Namespace prefix can be specified in the attribute/element name using double underscores as delimiter (prefix__name) or in the attribute/element constructor using ns_prefix argument. Namespace URI can be specified using ns_uri argument of the constructor.

from scrapy_rss.meta import ItemElementAttribute, ItemElement
from scrapy_rss.items import RssItem

class Element0(ItemElement):
    # attributes without special namespace
    attr0 = ItemElementAttribute(is_content=True, required=True)
    attr1 = ItemElementAttribute()

class Element1(ItemElement):
    # attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
    attr2 = ItemElementAttribute(ns_prefix="prefix2", ns_uri="id2")

    # attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
    prefix3__attr3 = ItemElementAttribute(ns_uri="id3")

    # attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
    fake_prefix__attr4 = ItemElementAttribute(ns_prefix="prefix4", ns_uri="id4")

    # attribute "attr5" with default namespace xmlns="id5"
    attr5 = ItemElementAttribute(ns_uri="id5")

class MyXMLItem(RssItem):
    # element <elem1> without namespace
    elem1 = Element0()

    # element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
    elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")

    # element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
    elem_prefix3__elem3 = Element1(ns_uri="id3e")

    # yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
    # (does not conflict with previous one)
    fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")

    # element <elem5> with default namespace xmlns="id5e"
    elem5 = Element0(ns_uri="id5e")

Access to elements and its attributes is the same as with simple items:

item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42

Several optional settings are allowed for namespaced items:

FEED_NAMESPACES

list of tuples [(prefix, URI), ...] or dictionary {prefix: URI, ...} of namespaces that must be defined in the root XML element

FEED_ITEM_CLASS or FEED_ITEM_CLS

main class of feed items (class object MyXMLItem or path to class "path.to.MyXMLItem"). Default value: RssItem. It’s used in order to extract all possible namespaces that will be declared in the root XML element.

Feed items do NOT have to be instances of this class or its subclass.

If these settings are not defined or only part of namespaces are defined then other used namespaces will be declared either in the <item> element or in its subelements when these namespaces are not unique. Each <item> element and its sublements always contains only namespace declarations of non-None attributes (including ones that are interpreted as element content).

Scrapy Project Examples

Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.

Just go to the Scrapy project directory and run commands

scrapy crawl first_spider
scrapy crawl second_spider

Thereafter feed.rss and feed2.rss files will be created in the same directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-rss-0.2.1.tar.gz (37.7 kB view details)

Uploaded Source

Built Distributions

scrapy_rss-0.2.1-py39-none-any.whl (21.5 kB view details)

Uploaded Python 3.9

scrapy_rss-0.2.1-py38-none-any.whl (21.5 kB view details)

Uploaded Python 3.8

scrapy_rss-0.2.1-py37-none-any.whl (21.5 kB view details)

Uploaded Python 3.7

scrapy_rss-0.2.1-py36-none-any.whl (21.5 kB view details)

Uploaded Python 3.6

scrapy_rss-0.2.1-py35-none-any.whl (21.4 kB view details)

Uploaded Python 3.5

scrapy_rss-0.2.1-py34-none-any.whl (21.5 kB view details)

Uploaded Python 3.4

scrapy_rss-0.2.1-py33-none-any.whl (25.0 kB view details)

Uploaded Python 3.3

scrapy_rss-0.2.1-py27-none-any.whl (21.5 kB view details)

Uploaded Python 2.7

File details

Details for the file scrapy-rss-0.2.1.tar.gz.

File metadata

  • Download URL: scrapy-rss-0.2.1.tar.gz
  • Upload date:
  • Size: 37.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy-rss-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b585c50ce629b74a78a902ab32274e66220827d224d9729ca2337eb13e72cc44
MD5 b57a00a4e88340d718cca8d53a89c19d
BLAKE2b-256 5bb1b45a465e7c0eebd254382c0d6c2caf0923f0ce30af4ceb8ec8706dc9c43f

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py39-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py39-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py39-none-any.whl
Algorithm Hash digest
SHA256 41977419f0dd226d1700ed4310ba605bf37f95e693c74b0f29cb6ede34c6fa07
MD5 b3d5502443d2498261f7c6ad325a7bcf
BLAKE2b-256 5bb52a0911826478200e214e24827b05cbaa3fd0145af58951f8b263a147d74e

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py38-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py38-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py38-none-any.whl
Algorithm Hash digest
SHA256 e33189df7b29e2c0c6c66e47e910926bf1252f0a37253f4e047fe2ddbd720592
MD5 21c6845f3681b26cd78e807c321160dc
BLAKE2b-256 4ada3082d3f0240fbbb4517fd2800d90ad8a0d41e17bb71a553c12698a74b317

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py37-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py37-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py37-none-any.whl
Algorithm Hash digest
SHA256 7d07f6cac251bd59c171a5103311ab77eca15d70872a90d68ce399dc071355d6
MD5 abf49fb64a67ef9b4a9ba9e840674e77
BLAKE2b-256 4e8831f92533a42b5ce0bfebef727a28cbc65a723523287c416d13b9643dfbcb

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py36-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py36-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3.6
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py36-none-any.whl
Algorithm Hash digest
SHA256 468f2b01de112eb6514b71212bd0b6f65c52d71b9e14bcefc121856b16f952eb
MD5 b6c57550f7eeeb0bb05a52148351ef20
BLAKE2b-256 0c2f90027d3f0402b1ae50c8059fa6cc4a414b251b3b2a95681083162f22d801

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py35-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py35-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3.5
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py35-none-any.whl
Algorithm Hash digest
SHA256 e15771bfd4a4daba5dcf58edb36e5ab6aef80a1352c8c32c8a3d5c6414373014
MD5 cf1524f177a0d9f9850fadbe4d60a893
BLAKE2b-256 e95bacdc67b489da0fb69dd2855853027c7ade9fa497920073dc22c0522d1c05

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py34-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py34-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3.4
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py34-none-any.whl
Algorithm Hash digest
SHA256 ae31e8de400354fc311e4df44afaaae35af7800bc5f4c486ac9aa1dd42105337
MD5 209dfd07345631236a5cd051c7b58935
BLAKE2b-256 07348f6499a5a18356c2c78295907848f9eb62c42d413d60ca6d426132c59f7e

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py33-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py33-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3.3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py33-none-any.whl
Algorithm Hash digest
SHA256 0ce88a5ee69a5d4528a2939438fa89109d01bd66e4c774b7e87b1d3c434de9c7
MD5 89a4b804bca0f33fafd7ce4f64673d18
BLAKE2b-256 b373cfa2249e8a6c2fe55202648cdea58ffe856f38753b5bddfbca17d0f13035

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.2.1-py27-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.2.1-py27-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 2.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.1.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.8.0

File hashes

Hashes for scrapy_rss-0.2.1-py27-none-any.whl
Algorithm Hash digest
SHA256 f054ae64db4014645149144bf4df5bfaef130cd0990d1c14e45c481e35d6a070
MD5 62bed98df36c20a14d6c1a3379d1abf8
BLAKE2b-256 7046b3636c60f5ffab0e7a51caac2b93e317dd688f93d7e2ff09d340213cbe83

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page