Skip to main content

Search sites for RSS, Atom, and JSON feeds

Project description

https://img.shields.io/pypi/v/feedsearch.svg https://img.shields.io/pypi/l/feedsearch.svg https://img.shields.io/pypi/pyversions/feedsearch.svg https://pepy.tech/badge/feedsearch

Feedsearch is a Python library for searching websites for RSS, Atom, and JSON feeds.

It was originally based on Feedfinder2 written by Dan Foreman-Mackey, which in turn is based on feedfinder - originally written by Mark Pilgrim and subsequently maintained by Aaron Swartz until his untimely death.

Feedsearch now differs a lot with Feedfinder2, in that Feedsearch supports JSON feeds, allows for optional fetching of Feed and Site metadata, and optionally searches the content of internal linked pages and default CMS feed locations.

Please Note: Development of this library is no longer ongoing except in the case of fixing reported bugs. Further development of Feedsearch functionality has now moved to Feedsearch Crawler.

Usage

Feedsearch is called with the single function search:

>>> from feedsearch import search
>>> feeds = search('xkcd.com')
>>> feeds
[FeedInfo('https://xkcd.com/atom.xml'), FeedInfo('https://xkcd.com/rss.xml')]
>>> feeds[0].url
'http://xkcd.com/atom.xml'

To get Feed and Site metadata:

>>> feeds = search('propublica.org', info=True)
>>> feeds
[FeedInfo('http://feeds.propublica.org/propublica/main')]
>>> pprint(vars(feeds[0]))
{'bozo': 0,
 'content_type': 'text/xml; charset=UTF-8',
 'description': 'Latest Articles and Investigations from ProPublica, an '
                'independent, non-profit newsroom that produces investigative '
                'journalism in the public interest.',
 'favicon': 'https://assets.propublica.org/prod/v3/images/favicon.ico',
 'favicon_data_uri': '',
 'hubs': ['http://feedpress.superfeedr.com/'],
 'is_push': True,
 'score': 4,
 'self_url': 'http://feeds.propublica.org/propublica/main',
 'site_name': 'ProPublica',
 'site_url': 'https://www.propublica.org/',
 'title': 'Articles and Investigations - ProPublica',
 'url': 'http://feeds.propublica.org/propublica/main',
 'version': 'rss20'}

Search will always return a list of FeedInfo objects, each of which will always have a url property. Feeds are sorted by the score value from highest to lowest, with a higher score theoretically indicating a more relevant feed compared to the original URL provided.

If you only want the raw urls, then use a list comprehension on the result, or set the as_urls parameter to True:

>>> feeds = search('http://jsonfeed.org')
>>> feeds
[FeedInfo('https://jsonfeed.org/xml/rss.xml'), FeedInfo('https://jsonfeed.org/feed.json')]
>>> urls = [f.url for f in feeds]
>>> urls
['https://jsonfeed.org/xml/rss.xml', 'https://jsonfeed.org/feed.json']

>>> feeds = search('http://jsonfeed.org', as_urls=True)
>>> feeds
>>> ['https://jsonfeed.org/xml/rss.xml', 'https://jsonfeed.org/feed.json']

In addition to the URL, the search function takes the following optional keyword arguments:

  • info: bool: Get Feed and Site Metadata. Defaults False.

  • check_all: bool: Check all internally linked pages of <a> tags for feeds, and default CMS feeds. Only checks one level down. Defaults False. May be very slow.

  • user_agent: str: User-Agent Header string. Defaults to Package name.

  • timeout: float or tuple(float, float): Timeout for each request in the search (not a timeout for the search method itself). Defaults to 3 seconds. See Requests timeout documentation for more info.

  • max_redirects: int: Maximum number of redirects for each request. Defaults to 30.

  • parser: str: BeautifulSoup parser for HTML parsing. Defaults to ‘html.parser’.

  • exceptions: bool: If False, will gracefully handle Requests exceptions and attempt to keep searching. If True, will leave Requests exceptions uncaught to be handled by the caller. Defaults False.

  • verify: bool or str: Verify SSL Certificates. See Requests SSL documentation for more info.

  • favicon_data_uri: bool: Convert Favicon to Data Uri. Defaults False.

  • as_urls: bool: Return found Feeds as a list of URL strings instead of FeedInfo objects.

  • cms: bool: Check default CMS feed location if no feeds already found and site is using a known CMS. Defaults True.

  • discovery_only: bool: Only search for RSS discovery tags (e.g. <link rel=”alternate” href=…>). Defaults False. Overridden by check_all if check_all is True.

FeedInfo Values

FeedInfo objects may have the following values if info is True:

  • bozo: int: Set to 1 when feed data is not well formed or may not be a feed. Defaults 0.

  • content_type: str: Content-Type value of the returned feed.

  • description: str: Feed description.

  • favicon: str: Url of site Favicon.

  • favicon_data_uri: str: Data Uri of site Favicon.

  • hubs: List[str]: List of Websub hubs of feed if available.

  • is_push: bool: True if feed contains valid Websub data.

  • score: int: Computed relevance of feed url value to provided URL. May be safely ignored.

  • self_url: str: ref=”self” value returned from feed links. In some cases may be different from feed url.

  • site_name: str: Name of feed’s website.

  • site_url: str: URL of feed’s website.

  • title: str: Feed Title.

  • url: str: URL location of feed.

  • version: str: Feed version XML values, or JSON feed.

Search Order

Feedsearch searches for feeds in the following order:

  1. If the URL points directly to a feed, then return that feed.

  2. If discovery_only is True, search only <link rel=”alternate”> tags. Return unless check_all is True.

  3. Search all <link> tags. Return if feeds are found and check_all is False.

  4. If cms or check_all is True, search for default CMS feeds if the site is using a known CMS. Return if feeds are found and check_all is False.

  5. Search all <a> tags. Return if check_all is False.

  6. This point will only be reached if check_all is True.

  7. Fetch the content of all internally pointing <a> tags whose URL paths indicate they may contain feeds. (e.g. /feed /rss /atom). All <link> tags and <a> tags of the content are searched, although not recusively. Return if feeds are found. This step may be very slow, so be sure whether you want check_all enabled.

  8. If step 7 failed to find feeds, then as a last resort we make a few guesses for potential feed urls.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

feedsearch-1.0.12.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

feedsearch-1.0.12-py35.py36.py37-none-any.whl (18.9 kB view details)

Uploaded Python 3.5 Python 3.6 Python 3.7

File details

Details for the file feedsearch-1.0.12.tar.gz.

File metadata

  • Download URL: feedsearch-1.0.12.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.3

File hashes

Hashes for feedsearch-1.0.12.tar.gz
Algorithm Hash digest
SHA256 b3636a0e55a2fc762739a69843cc755656d2dcd1c51014c45df03b28fe8e6df4
MD5 aaedd6e512e882cba8bd1cee176d77d6
BLAKE2b-256 daffa7eb8df64ac53b0b59de4520faafa358b90b2960383e07b058b8b5318bc0

See more details on using hashes here.

File details

Details for the file feedsearch-1.0.12-py35.py36.py37-none-any.whl.

File metadata

  • Download URL: feedsearch-1.0.12-py35.py36.py37-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3.5, Python 3.6, Python 3.7
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.3

File hashes

Hashes for feedsearch-1.0.12-py35.py36.py37-none-any.whl
Algorithm Hash digest
SHA256 612c2d14d8de90f6501be40bf3463d67d04a572f37a043b4b765c96f9d597518
MD5 7e9b00f09b09eca6626b6956c02caf9c
BLAKE2b-256 81eac0d9af01b48ce7f7c5675c81264ec86ff42c1d2b6ec59d8f79358a8ecb64

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page