Skip to main content

RSS feed parser: Takes a URL and configuration dict and returns an iterable object containing feed `<items>`

Project description

# rss_parse
## About rss_parse:
rss_parse is a module for Python 3.4.2 or newer. It takes an RSS feed URL and a dictionary object that contains xpaths to the relevent data as input, fetchs the RSS feed data, parses it, and returns it as an iterable object where each element contains the following details from each `<item>` in the RSS feed: title, body, url, publication date, and image resource URL.

## Sample Usage:
### Using a standard Python dictionary as a configuration object.
```python
from rss_parse import RSSParser

rss_url = 'http://www.jpl.nasa.gov/multimedia/rss/news.xml'
xpath_configuration = { 'xpathParse': {
'stripHTML': True,
'item': '/rss/channel/item',
'namespace': {'re': 'http://exslt.org/regular-expressions'},
'title': './/title/text()',
'url': './/link/text()',
'body': './/description/text()',
'date': './/pubDate/text()',
'image': '((re:match(.//description/text(), '
'\'www.jpl.nasa.gov/images/[^\\">]+\', '
"'g')/text()) | /rss/channel/image/url/text())[1]"
}}

parsed_feed = RSSParser(rss_url, xpath_configuration)
print(parsed_feed[0].title)
```

rss_parse.RSSParser uses XPaths to identify the various parts of a news article in an RSS feed. XPaths are an entire separate topic not covered in this documentation. However, you can generally think of them as being like a directory structure where the first item in the path encapsulates the subsequent items. So given the XML <foo><bar><baz1></baz1><baz2>Hi!</baz2></bar></foo>, the XPath /foo/bar/baz2 would point us at the data in the baz2 item and /foo/bar/baz2/text() would give us just the text Hi!
> **NOTE:** Except for the XPath for the `item` key, all XPaths are relative to the `<item>` tag.


####In top-down ordering, we see the following:
#### Key: `xpathParse:`
Value: The value is a dictionary containing the following key:value pairs.

#### Key: `stripHTML:`
Value: This will either be `true` or `false` depending on if the RSS feed has undesired HTML content in the main body (description/summary) text. Generally it's a good idea to simply set this to `true`. However, some RSS feeds, such as Google News, add links to recommended stories. Stripping HTML in those cases can make the summary text confusing to read. A future version of xkcd_news will have an additional option to fine-tune what content should be stripped from the feed.

#### Key: `item:`
Value: This is a fully specified XPath to news items (headlines/articles) in the feed. Generally, this will never need to be changed. The exception might be for Atom feeds wich use a slightly different specification that is similar to RSS.

#### Key: `namespace:`
Value: Namespaces are a part of XML and deserve their own section that won't be covered here. In rss_parse, they're generally used to help specify the XPath to an image associated with a specific news item in the RSS feed. If you are unsure what to use here, simply leave the value as an empty dictionary (e.g. `{}`)

#### Key: `title:`
Value: This value is a relative XPath where the specific item in the XPath `/rss/channel/item` is handled for you. This is the effectively the headline of the news article. It is unlikely you will need to change this.

#### Key: `url:`
Value: This is the relative XPath that specifies a link to the full news article. It is unlikely you will need to change this.

#### Key: `body:`
Value: This is the relative XPath that specifies the summary/description text of the news article. It is unlikely you will need to change this.

#### Key: `date:`
Value: This is the relative XPath that specifies the publication date of the news article. It is unlikely you will need to change this. This date value determines the order of the final output.

#### Key: `image:`
Value: An image is not part of the default RSS specification. The result is that this value will likely need to be changed for any given RSS feed. In the example, we use the `re` namespace to use a regular expression to parse the image URL from the `body` content. See the xkcd_news project for additional examples.

#### The RSSParser() Output:
The output from creating the RSSParser can be treated as a list. Each item in that list contains the values retreived by the associated XPaths (as described above). To build on the above example, we could do the following with the parsed_feed variable.

```python
for item in parsed_feed:
print(item.url) # the URL to the specific <item> in the RSS feed. (e.g. a link to a news story)
print(item.title) # the title of the <item> (e.g. the headline of a news article)
print(item.body) # the main body text of the <item> (e.g. the summary text of a news article)
print(item.date) # the date the <item> was added or updated in the RSS feed (e.g. the publication date of a news article)
print(item.image) # the URL to an image associated with <item>. This is sometimes None. (e.g. the logo of a news service)
```

### Other Configuration formats:
> *NOTE:* You must convert these into a Python dictionary before passing them to RSSParser(). The below is for formatting reference.

#### YAML:
```yaml
xpathParse:
stripHTML: true
item: '/rss/channel/item'
namespace:
re: http://exslt.org/regular-expressions
title: .//title/text()
url: .//link/text()
body: .//description/text()
date: .//pubDate/text()
image: ((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]
```

#### JSON:
```json
{
"xpathParse": {
"item": "/rss/channel/item",
"url": ".//link/text()",
"body": ".//description/text()",
"date": ".//pubDate/text()",
"stripHTML": true,
"namespace": {
"re": "http://exslt.org/regular-expressions"
},
"title": ".//title/text()",
"image": "((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\\\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]"
}
}
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
rss_parse-0.1.0b1-py3.4.egg (7.4 kB) Copy SHA256 hash SHA256 Egg 3.4 Sep 3, 2016

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page