Skip to main content

Scrapy pipeline and items to create and store RSS feeds for podcasts.

Project description

scrapy-podcast-rss

Build Status
This package provides a Scrapy pipeline and items to generate a podcast RSS feed from scraped information. It also allows to save the content locally or in an S3 bucket. You can then point your podcast player to the URL of the file and listen to its content.

Installation

Install scrapy-podcast-rss using pip:

$ pip install scrapy-podcast-rss

Configuration

  1. You need to define OUTPUT_URI in your settings.py file, this will determine where your feed will be stored. For example:
    OUTPUT_URI = './my-podcast.xml'  # Local file.
    OUTPUT_URI = 's3://my-bucket/my-podcast.xml'  # S3 bucket (read note on S3 storage).
    
  2. Add PodcastPipeline in ITEM_PIPELINES in your settings.py file:
    ITEM_PIPELINES = {
        'scrapy_podcast_rss.pipelines.PodcastPipeline': 300,
    }
    

Usage

scrapy-podcast-rss defines two special items:

  • PodcastDataItem: Stores information about the podcast.
  • PodcastEpisodeItem: Stores information about each episode of the podcast.

You must yield one PodcastDataItem and one PodcastEpisodeItem for each episode you want to export, before your spider closes.

Here is the information you can currently store in each item (you need to use the same names):

  • PodcastDataItem:
    • title: Title of the podcast.
    • description: Description of the podcast.
    • url: URL referencing the podcast.
    • image_url: Main image of the podcast.
  • PodcastDataItem:
    • title: Title of the episode.
    • description: Description of the episode.
    • publication_date: Date of publication (datetime object with timezone).
    • audio_url: URL of the audio.
    • guid: Unique identifier of the episode.

Example

You can find a minimal example of a spider using this package here: scrapy-podcast-rss-example.
You can also find an example of the package being used in an AWS Lambda function here: scrapy-podcast-rss-serverless.

Spider example

import datetime
import scrapy
import pytz
from scrapy_podcast_rss import PodcastEpisodeItem, PodcastDataItem


class SimpleSpider(scrapy.Spider):
    name = "simple_spider"
    start_urls = ['http://example.com/']
    custom_settings = {
        'OUTPUT_URI': './my-podcast.xml',
        'ITEM_PIPELINES': {'scrapy_podcast_rss.pipelines.PodcastPipeline': 300, }
    }

    def parse(self, response):
        podcast_data_item = PodcastDataItem()
        podcast_data_item['title'] = "Podcast title"
        podcast_data_item['description'] = "Description of the podcast."
        podcast_data_item['url'] = "Podcast's URL"
        podcast_data_item['image_url'] = "https://live.staticflickr.com/4211/35400224382_9edcb984e5_c.jpg"  # Sample image
        yield podcast_data_item

        episode_item = PodcastEpisodeItem()
        episode_item['title'] = "Episode title"
        episode_item['description'] = "Episode description"
        pub_date_tz = datetime.datetime.strptime("01/01/2020", "%m/%d/%Y").replace(tzinfo=pytz.UTC)
        episode_item['publication_date'] = pub_date_tz  # Publication date NEEDS to have a TIME ZONE.
        episode_item['guid'] = "Episode guid"  # Simulated identifier.
        episode_item['audio_url'] = "https://ia801803.us.archive.org/13/items/MOZARTSerenadeEineKleineNachtmusikK." \
                                    "525-NEWTRANSFER01.I.Allegro/01.I.Allegro.mp3 "  # Sample audio url.
        yield episode_item

Note on using S3 as storage

To use S3 storage locations, you can install scrapy-podcast-rss by doing:

$ pip install scrapy-podcast-rss[s3_storage]

This will simply include boto3 in the dependencies.
Once installed, you will need to have your credentials configured (boto3 quickstart guide).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-podcast-rss-0.0.3.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

scrapy_podcast_rss-0.0.3-py3-none-any.whl (8.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-podcast-rss-0.0.3.tar.gz.

File metadata

  • Download URL: scrapy-podcast-rss-0.0.3.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.6.0.post20191101 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for scrapy-podcast-rss-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b358af013b2dcd7dbe6f91ff64ebe6ddb44919c90f57cc9754e7acd488beeda8
MD5 774878752055e03ac97c782316e22835
BLAKE2b-256 2fc951072b304479d6debd68f2760e7017cb10d0f302d146555151e6fbdfd5d6

See more details on using hashes here.

File details

Details for the file scrapy_podcast_rss-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: scrapy_podcast_rss-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 8.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.6.0.post20191101 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for scrapy_podcast_rss-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b8e18238947384e73be32fe22b6761544749d9025ed2dfa7b3f350752e098c78
MD5 6821cc51e69b781c3820635519ecff4b
BLAKE2b-256 ea35277216ff6924be5bebabcb6ea95d3547fbc8c93ab3298f32effbfde5bcc0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page