Skip to main content

RSS feed parser: Takes a URL and configuration dict and returns an iterable object containing feed `<items>`

Project description

# rss_parse
## About rss_parse:
rss_parse is a module for Python 3.4.2 or newer. It takes an RSS feed URL and a dictionary object that contains xpaths to the relevent data as input, fetchs the RSS feed data, parses it, and returns it as an iterable object where each element contains the following details from each `<item>` in the RSS feed: title, body, url, publication date, and image resource URL.

## Sample Usage:
### Using a standard Python dictionary as a configuration object.
```python
from rss_parse import RSSParser

rss_url = 'http://www.jpl.nasa.gov/multimedia/rss/news.xml'
xpath_configuration = { 'xpathParse': {
'stripHTML': True,
'item': '/rss/channel/item',
'namespace': {'re': 'http://exslt.org/regular-expressions'},
'title': './/title/text()',
'url': './/link/text()',
'body': './/description/text()',
'date': './/pubDate/text()',
'image': '((re:match(.//description/text(), '
'\'www.jpl.nasa.gov/images/[^\\">]+\', '
"'g')/text()) | /rss/channel/image/url/text())[1]"
}}

parsed_feed = RSSParser(rss_url, xpath_configuration)
print(parsed_feed[0].title)
```

rss_parse.RSSParser uses XPaths to identify the various parts of a news article in an RSS feed. XPaths are an entire separate topic not covered in this documentation. However, you can generally think of them as being like a directory structure where the first item in the path encapsulates the subsequent items. So given the XML <foo><bar><baz1></baz1><baz2>Hi!</baz2></bar></foo>, the XPath /foo/bar/baz2 would point us at the data in the baz2 item and /foo/bar/baz2/text() would give us just the text Hi!
> **NOTE:** Except for the XPath for the `item` key, all XPaths are relative to the `<item>` tag.


####In top-down ordering, we see the following:
#### Key: `xpathParse:`
Value: The value is a dictionary containing the following key:value pairs.

#### Key: `stripHTML:`
Value: This will either be `true` or `false` depending on if the RSS feed has undesired HTML content in the main body (description/summary) text. Generally it's a good idea to simply set this to `true`. However, some RSS feeds, such as Google News, add links to recommended stories. Stripping HTML in those cases can make the summary text confusing to read. A future version of xkcd_news will have an additional option to fine-tune what content should be stripped from the feed.

#### Key: `item:`
Value: This is a fully specified XPath to news items (headlines/articles) in the feed. Generally, this will never need to be changed. The exception might be for Atom feeds wich use a slightly different specification that is similar to RSS.

#### Key: `namespace:`
Value: Namespaces are a part of XML and deserve their own section that won't be covered here. In rss_parse, they're generally used to help specify the XPath to an image associated with a specific news item in the RSS feed. If you are unsure what to use here, simply leave the value as an empty dictionary (e.g. `{}`)

#### Key: `title:`
Value: This value is a relative XPath where the specific item in the XPath `/rss/channel/item` is handled for you. This is the effectively the headline of the news article. It is unlikely you will need to change this.

#### Key: `url:`
Value: This is the relative XPath that specifies a link to the full news article. It is unlikely you will need to change this.

#### Key: `body:`
Value: This is the relative XPath that specifies the summary/description text of the news article. It is unlikely you will need to change this.

#### Key: `date:`
Value: This is the relative XPath that specifies the publication date of the news article. It is unlikely you will need to change this. This date value determines the order of the final output.

#### Key: `image:`
Value: An image is not part of the default RSS specification. The result is that this value will likely need to be changed for any given RSS feed. In the example, we use the `re` namespace to use a regular expression to parse the image URL from the `body` content. See the xkcd_news project for additional examples.

#### The RSSParser() Output:
The output from creating the RSSParser can be treated as a list. Each item in that list contains the values retreived by the associated XPaths (as described above). To build on the above example, we could do the following with the parsed_feed variable.

```python
for item in parsed_feed:
print(item.url) # the URL to the specific <item> in the RSS feed. (e.g. a link to a news story)
print(item.title) # the title of the <item> (e.g. the headline of a news article)
print(item.body) # the main body text of the <item> (e.g. the summary text of a news article)
print(item.date) # the date the <item> was added or updated in the RSS feed (e.g. the publication date of a news article)
print(item.image) # the URL to an image associated with <item>. This is sometimes None. (e.g. the logo of a news service)
```

### Other Configuration formats:
> *NOTE:* You must convert these into a Python dictionary before passing them to RSSParser(). The below is for formatting reference.

#### YAML:
```yaml
xpathParse:
stripHTML: true
item: '/rss/channel/item'
namespace:
re: http://exslt.org/regular-expressions
title: .//title/text()
url: .//link/text()
body: .//description/text()
date: .//pubDate/text()
image: ((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]
```

#### JSON:
```json
{
"xpathParse": {
"item": "/rss/channel/item",
"url": ".//link/text()",
"body": ".//description/text()",
"date": ".//pubDate/text()",
"stripHTML": true,
"namespace": {
"re": "http://exslt.org/regular-expressions"
},
"title": ".//title/text()",
"image": "((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\\\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]"
}
}
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rss_parse-0.1.0b1-py3.4.egg (7.4 kB view details)

Uploaded Egg

File details

Details for the file rss_parse-0.1.0b1-py3.4.egg.

File metadata

File hashes

Hashes for rss_parse-0.1.0b1-py3.4.egg
Algorithm Hash digest
SHA256 b65cbd1da61744ee1183e8f7fabf626d4685956df5c8c15ff6ebba3f655faac8
MD5 d0ea390d6e9ec6216de6061f91d0a20f
BLAKE2b-256 d4b0077230291f0e060e047fc83d1b57a540c69e2fb9600f6317b30441e3ae47

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page