Skip to main content

Search sites for RSS and JSON feeds

Project description

https://img.shields.io/pypi/v/feedsearch.svg https://img.shields.io/pypi/l/feedsearch.svg https://img.shields.io/pypi/pyversions/feedsearch.svg

Feedsearch is a Python library for searching websites for RSS, Atom, and JSON feeds.

It was originally based on Feedfinder2 written by Dan Foreman-Mackey, which in turn is based on feedfinder - originally written by Mark Pilgrim and subsequently maintained by Aaron Swartz until his untimely death.

The main differences with Feedfinder2 are that Feedsearch supports JSON feeds, and allows for optional fetching of Feed and Site metadata.

Usage

Feedsearch is called with the single function search:

>>> from feedsearch import search
>>> feeds = search('xkcd.com')
>>> feeds
[FeedInfo('https://xkcd.com/atom.xml'), FeedInfo('https://xkcd.com/rss.xml')]
>>> feeds[0].url
'http://xkcd.com/atom.xml'

To get Feed and Site metadata:

>>> feeds = search('propublica.org', info=True)
>>> feeds
[FeedInfo('http://feeds.propublica.org/propublica/main')]
>>> pprint(vars(feeds[0]))
{'bozo': 0,
 'content_type': 'text/xml; charset=UTF-8',
 'description': 'Latest Articles and Investigations from ProPublica, an '
                'independent, non-profit newsroom that produces investigative '
                'journalism in the public interest.',
 'favicon': 'https://assets.propublica.org/prod/v3/images/favicon.ico',
 'favicon_data_uri': '',
 'hubs': ['http://feedpress.superfeedr.com/'],
 'is_push': True,
 'score': 4,
 'self_url': 'http://feeds.propublica.org/propublica/main',
 'site_name': 'ProPublica',
 'site_url': 'https://www.propublica.org/',
 'title': 'Articles and Investigations - ProPublica',
 'url': 'http://feeds.propublica.org/propublica/main',
 'version': 'rss20'}

Search will always return a list of FeedInfo objects, each of which will always have a url property. Feeds are sorted by the score value from highest to lowest, with a higher score theoretically indicating a more relevant feed compared to the original URL provided.

If you only want the raw urls, then use a list comprehension on the result, or set the as_urls parameter to True:

>>> feeds = search('http://jsonfeed.org')
>>> feeds
[FeedInfo('https://jsonfeed.org/xml/rss.xml'), FeedInfo('https://jsonfeed.org/feed.json')]
>>> urls = [f.url for f in feeds]
>>> urls
['https://jsonfeed.org/xml/rss.xml', 'https://jsonfeed.org/feed.json']

>>> feeds = search('http://jsonfeed.org', as_urls=True)
>>> feeds
>>> ['https://jsonfeed.org/xml/rss.xml', 'https://jsonfeed.org/feed.json']

In addition to the URL, the search function takes the following optional keyword arguments:

  • info: bool: Get Feed and Site Metadata. Defaults False.

  • check_all: bool: Check all <link> and <a> tags on page. Defaults False.

  • user_agent: str: User-Agent Header string. Defaults to Package name.

  • timeout: int or tuple: Timeout for each request in the search (not a timeout for the search method itself). Defaults to 30 seconds.

  • max_redirects: int: Maximum number of redirects for each request. Defaults to 30.

  • parser: str: BeautifulSoup parser for HTML parsing. Defaults to ‘html.parser’.

  • exceptions: bool: If False, will gracefully handle Requests exceptions and attempt to keep searching. If True, will leave Requests exceptions uncaught to be handled by the caller. Defaults False.

  • favicon_data_uri: bool: Convert Favicon to Data Uri. Defaults False.

  • as_urls: bool: Return found Feeds as a list of URL strings instead of FeedInfo objects.

FeedInfo Values

FeedInfo objects may have the following values if info is True:

  • bozo: int: Set to 1 when feed is not well formed. Defaults 0.

  • content_type: str: Content-Type value of the returned feed.

  • description: str: Feed description.

  • favicon: str: Url of site Favicon.

  • favicon_data_uri: str: Data Uri of site Favicon.

  • hubs: List[str]: List of Websub hubs of feed if available.

  • is_push: bool: True if feed contains valid Websub data.

  • score: int: Computed relevance of feed url value to provided URL. May be safely ignored.

  • self_url: str: ref=”self” value returned from feed links. In some cases may be different from feed url.

  • site_name: str: Name of feed’s website.

  • site_url: str: URL of feed’s website.

  • title: str: Feed Title.

  • url: str: URL location of feed.

  • version: str: Feed version XML values, or JSON feed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feedsearch-0.0.7.tar.gz (12.9 kB view hashes)

Uploaded Source

Built Distribution

feedsearch-0.0.7-py36.py37-none-any.whl (16.4 kB view hashes)

Uploaded Python 3.6 Python 3.7

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page