Skip to main content

RSS Tools for Scrapy Framework

Project description

PyPI Version Wheel Status Coverage report

Tools to easy generate RSS feed that contains each scraped item using Scrapy framework.

Package works with Python 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 and 3.10.

If you use Python 3.3 then you have to use Scrapy<1.5.0.

If you use Python 2.7 then you have to use Scrapy<2.0.

Table of Contents

Installation

  • Install scrapy_rss using pip

    pip install scrapy_rss

    or using pip for the specific interpreter, e.g.:

    pip3 install scrapy_rss
  • or using setuptools directly:

    cd path/to/root/of/scrapy_rss
    python setup.py install

    or using setuptools for specific interpreter, e.g.:

    cd path/to/root/of/scrapy_rss
    python3 setup.py install

How To Use

Configuration

Add parameters to the Scrapy project settings (settings.py file) or to the custom_settings attribute of the spider:

  1. Add item pipeline that export items to rss feed:

    ITEM_PIPELINES = {
        # ...
        'scrapy_rss.pipelines.RssExportPipeline': 900,  # or another priority
        # ...
    }
  2. Add required feed parameters:

    FEED_FILE

    the absolute or relative file path where the result RSS feed will be saved. For example, feed.rss or output/feed.rss.

    FEED_TITLE

    the name of the channel (feed),

    FEED_DESCRIPTION

    the phrase or sentence that describes the channel (feed),

    FEED_LINK

    the URL to the HTML website corresponding to the channel (feed)

    FEED_FILE = 'path/to/feed.rss'
    FEED_TITLE = 'Some title of the channel'
    FEED_LINK = 'http://example.com/rss'
    FEED_DESCRIPTION = 'About channel'

Feed (Channel) Elements Customization [optionally]

If you want to change other channel parameters (such as language, copyright, managing_editor, webmaster, pubdate, last_build_date, category, generator, docs, ttl) then define your own exporter that’s inherited from RssItemExporter class, for example:

from scrapy_rss.exporters import RssItemExporter

class MyRssItemExporter(RssItemExporter):
   def __init__(self, *args, **kwargs):
      kwargs['generator'] = kwargs.get('generator', 'Special generator')
      kwargs['language'] = kwargs.get('language', 'en-us')
      super(CustomRssItemExporter, self).__init__(*args, **kwargs)

And add FEED_EXPORTER parameter to the Scrapy project settings or to the custom_settings attribute of the spider:

FEED_EXPORTER = 'myproject.exporters.MyRssItemExporter'

Usage

Basic usage

Declare your item directly as RssItem():

import scrapy_rss

item1 = scrapy_rss.RssItem()

Or use predefined item class RssedItem with RSS field named as rss that’s instance of RssItem:

import scrapy
import scrapy_rss

class MyItem(scrapy_rss.RssedItem):
    field1 = scrapy.Field()
    field2 = scrapy.Field()
    # ...

item2 = MyItem()

Set/get item fields. Case sensitive attributes of RssItem() are appropriate to RSS elements. Attributes of RSS elements are case sensitive too. If the editor allows autocompletion then it suggests attributes for instances of RssedItem and RssItem. It’s allowed to set any subset of RSS elements (e.g. title only). For example:

from datetime import datetime

item1.title = 'RSS item title'  # set value of <title> element
title = item1.title.title  # get value of <title> element
item1.description = 'description'

item1.guid = 'item identifier'
item1.guid.isPermaLink = True  # set value of attribute isPermalink of <guid> element,
                               # isPermaLink is False by default
is_permalink = item1.guid.isPermaLink  # get value of attribute isPermalink of <guid> element
guid = item1.guid.guid  # get value of element <guid>

item1.category = 'single category'
category = item1.category
item1.category = ['first category', 'second category']
first_category = item1.category[0].category # get value of the element <category> with multiple values
all_categories = [cat.category for cat in item1.category]

# direct attributes setting
item1.enclosure.url = 'http://example.com/file'
item1.enclosure.length = 0
item1.enclosure.type = 'text/plain'

# or dict based attributes setting
item1.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}
item1.guid = {'guid': 'item identifier', 'isPermaLink': True}

item1.pubDate = datetime.now()  # correctly works with Python' datetimes


item2.rss.title = 'Item title'
item2.rss.guid = 'identifier'
item2.rss.enclosure = {'url': 'http://example.com/file', 'length': 0, 'type': 'text/plain'}

All allowed elements are listed in the scrapy_rss/items.py. All allowed attributes of each element with constraints and default values are listed in the scrapy_rss/elements.py. Also you can read RSS specification for more details.

RssItem derivation and namespaces

You can extend RssItem to add new XML fields that can be namespaced or not. You can specify namespaces in an attribute and/or an element constructors. Namespace prefix can be specified in the attribute/element name using double underscores as delimiter (prefix__name) or in the attribute/element constructor using ns_prefix argument. Namespace URI can be specified using ns_uri argument of the constructor.

from scrapy_rss.meta import ItemElementAttribute, ItemElement
from scrapy_rss.items import RssItem

class Element0(ItemElement):
    # attributes without special namespace
    attr0 = ItemElementAttribute(is_content=True, required=True)
    attr1 = ItemElementAttribute()

class Element1(ItemElement):
    # attribute "prefix2:attr2" with namespace xmlns:prefix2="id2"
    attr2 = ItemElementAttribute(ns_prefix="prefix2", ns_uri="id2")

    # attribute "prefix3:attr3" with namespace xmlns:prefix3="id3"
    prefix3__attr3 = ItemElementAttribute(ns_uri="id3")

    # attribute "prefix4:attr4" with namespace xmlns:prefix4="id4"
    fake_prefix__attr4 = ItemElementAttribute(ns_prefix="prefix4", ns_uri="id4")

    # attribute "attr5" with default namespace xmlns="id5"
    attr5 = ItemElementAttribute(ns_uri="id5")

class MyXMLItem(RssItem):
    # element <elem1> without namespace
    elem1 = Element0()

    # element <elem_prefix2:elem2> with namespace xmlns:elem_prefix2="id2e"
    elem2 = Element0(ns_prefix="elem_prefix2", ns_uri="id2e")

    # element <elem_prefix3:elem3> with namespace xmlns:elem_prefix3="id3e"
    elem_prefix3__elem3 = Element1(ns_uri="id3e")

    # yet another element <elem_prefix4:elem3> with namespace xmlns:elem_prefix4="id4e"
    # (does not conflict with previous one)
    fake_prefix__elem3 = Element0(ns_prefix="elem_prefix4", ns_uri="id4e")

    # element <elem5> with default namespace xmlns="id5e"
    elem5 = Element0(ns_uri="id5e")

Access to elements and its attributes is the same as with simple items:

item = MyXMLItem()
item.title = 'Some title'
item.elem1.attr0 = 'Required content value'
item.elem1 = 'Another way to set content value'
item.elem1.attr1 = 'Some attribute value'
item.elem_prefix3__elem3.prefix3__attr3 = 'Yet another attribute value'
item.elem_prefix3__elem3.fake_prefix__attr4 = '' # non-None value is interpreted as assigned
item.fake_prefix__elem3.attr1 = 42

Several optional settings are allowed for namespaced items:

FEED_NAMESPACES

list of tuples [(prefix, URI), ...] or dictionary {prefix: URI, ...} of namespaces that must be defined in the root XML element

FEED_ITEM_CLASS or FEED_ITEM_CLS

main class of feed items (class object MyXMLItem or path to class "path.to.MyXMLItem"). Default value: RssItem. It’s used in order to extract all possible namespaces that will be declared in the root XML element.

Feed items do NOT have to be instances of this class or its subclass.

If these settings are not defined or only part of namespaces are defined then other used namespaces will be declared either in the <item> element or in its subelements when these namespaces are not unique. Each <item> element and its sublements always contains only namespace declarations of non-None attributes (including ones that are interpreted as element content).

Scrapy Project Examples

Examples directory contains several Scrapy projects with the scrapy_rss usage demonstration. It crawls this website whose source code is here.

Just go to the Scrapy project directory and run commands

scrapy crawl first_spider
scrapy crawl second_spider

Thereafter feed.rss and feed2.rss files will be created in the same directory.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-rss-0.3.1.tar.gz (37.3 kB view details)

Uploaded Source

Built Distributions

scrapy_rss-0.3.1-py310-none-any.whl (19.0 kB view details)

Uploaded Python 3.10

scrapy_rss-0.3.1-py39-none-any.whl (19.0 kB view details)

Uploaded Python 3.9

scrapy_rss-0.3.1-py38-none-any.whl (19.0 kB view details)

Uploaded Python 3.8

scrapy_rss-0.3.1-py37-none-any.whl (19.0 kB view details)

Uploaded Python 3.7

scrapy_rss-0.3.1-py36-none-any.whl (19.0 kB view details)

Uploaded Python 3.6

scrapy_rss-0.3.1-py35-none-any.whl (18.9 kB view details)

Uploaded Python 3.5

scrapy_rss-0.3.1-py34-none-any.whl (19.0 kB view details)

Uploaded Python 3.4

scrapy_rss-0.3.1-py33-none-any.whl (22.7 kB view details)

Uploaded Python 3.3

scrapy_rss-0.3.1-py27-none-any.whl (19.0 kB view details)

Uploaded Python 2.7

File details

Details for the file scrapy-rss-0.3.1.tar.gz.

File metadata

  • Download URL: scrapy-rss-0.3.1.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy-rss-0.3.1.tar.gz
Algorithm Hash digest
SHA256 5ea58b4adf46f3d73d206f9c79de195b4dd48f7b7016402a3b9f2bf4820aafc3
MD5 4ae2f865d490150a36e296ee0b79e2ef
BLAKE2b-256 853951075985efc6be0879e279e287a75aca41613557b3218cf0abe7adada6c9

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py310-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py310-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.10
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py310-none-any.whl
Algorithm Hash digest
SHA256 373a5577dabd176a4d8f8f0e408fc76a49556e93a4e1324637ae830cb3396203
MD5 fba541fd22230c495e272b4032c7f407
BLAKE2b-256 14e61e4e643c946193075ddc8955fd2cfc56525f967144814b36c625b4880cea

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py39-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py39-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py39-none-any.whl
Algorithm Hash digest
SHA256 edcdc9587bfe3285b28b66c91a8abe59980bdc5ca60af0d8d5cc2325c235b1da
MD5 f48754a7449a606065ae4cbec9d3ec44
BLAKE2b-256 7bbdfd53fb8efe20fb5264aa2bd69bcbefdd2b4bc242c35611e2231db1fdb126

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py38-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py38-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py38-none-any.whl
Algorithm Hash digest
SHA256 bae636557dd6b82e7c2e87de25d54eedfe55e0ef03424629ecd3162dfb810166
MD5 86302ef71d944ae167d45c3c46990ae8
BLAKE2b-256 3fc81144519e53d11889d6012a1d6e2ec40cc16134ec4386b5d8496a2fcf398f

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py37-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py37-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py37-none-any.whl
Algorithm Hash digest
SHA256 68660fe9e340123d5fa871e560245768471aca2d93bcfcef342b820b8d9f2f79
MD5 cd312d101d7a1e2769deff2411692ab7
BLAKE2b-256 860bc8f1321daac5d63790672cc5d5808c597a255e754e577b4de2dc849f9b29

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py36-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py36-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.6
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py36-none-any.whl
Algorithm Hash digest
SHA256 4b49ace69f42166e7ec1946ef6ad1caeffb477ea35e9e98f22b6469bcdfafbe1
MD5 a6a6cb922ba29753f59a1bcbf6cf759d
BLAKE2b-256 e9d04b4aaff25dc3455f5518a3a8e50a1c4f9aaf4c3b9d48d52fa90b46f41c87

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py35-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py35-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3.5
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py35-none-any.whl
Algorithm Hash digest
SHA256 97d57dd4e18437d93d34fc6f0f825d643b061a2b87aaebce03e28959931d3ed4
MD5 3e0de19e3a505c1c2e220add2c5ce7e6
BLAKE2b-256 8289ad66821fa64f3926b6f1174c51fa16a2a27f58a22e51e05707b19f5c29c4

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py34-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py34-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3.4
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py34-none-any.whl
Algorithm Hash digest
SHA256 84982011c3fb5e6c25191f3243e38937602ede3c943f3d30e5b1e44ef9c8417c
MD5 916c3d94473524c14ed5f48d8a4ddfba
BLAKE2b-256 9f123ec2f265a7d08eaaffd6c465f0dd2fdffd6b64ea3e8d6a7610553204517d

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py33-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py33-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3.3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py33-none-any.whl
Algorithm Hash digest
SHA256 77a78351aefa2d9db4becc15b43f653c4c15f3f543a4d1d31910ef67fb88f0bb
MD5 2a83ee1ad0f633b945efafa042efccad
BLAKE2b-256 16d94a151ebfffe08b54fec86fc91fa1f19ee85f7793ec30a98dbed9bc651e3f

See more details on using hashes here.

File details

Details for the file scrapy_rss-0.3.1-py27-none-any.whl.

File metadata

  • Download URL: scrapy_rss-0.3.1-py27-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 2.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.8.10

File hashes

Hashes for scrapy_rss-0.3.1-py27-none-any.whl
Algorithm Hash digest
SHA256 0757771635b400936230df458002bbb1f7998957fe6eb6875844f17d6a9a71f0
MD5 3bad0ee42e12e886d31c0ca768ba6bfc
BLAKE2b-256 ab4dbbca64a7c945e28ff427b7b812883a2ba377de35c951db8e56100746bcac

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page