Skip to main content

Get the normalized latest news from (almost) any website

Project description

Newscatcher

Programmatically collect normalized news from (almost) any website. By newscatcherapi.com.

Demo

Motivation

While working on newscatcherapi -- JSON API to query the news articles, I came up with an idea to make a simple Python package that would allow to easily grab the live news data.

When I used to be a junior data scientist working on my own side projects, it was difficult for me to operate with external data sources. I knew Python quite well, but in most cases it was not enough to build proper data pipelines that required gathering data on my own.

Even though I do not recommend to use this package for any production systems, I believe that it should be enough to test your assumptions and build some MVPs.

Installation

pip install newscatcher

Tech/framework used

The package itself is nothing more than a SQLite database with RSS feed endpoints for each website and some basic wrapper of feedparser.

Code Example/Documentation

Let's review all possible usage of the package.

In its core, it has a class called Newscatcher. This class is all you need in order to get latest news.

After installing your package, import the class:

from newscatcher import Newscatcher

Now you just need to put a url of a desired news source as an input into our class. Please take the base form url of a website (without www.,neither https://, nor / at the end of url).

For example: “nytimes”.com, “news.ycombinator.com” or “theverge.com”.

news_source = Newscatcher('blackfaldslife.com')

If you have done it right and the source that you chose is presented in our database, you will get a variable with 3 components and 1 method:

  • news_source.website -- the same string that you entered inside the class.
  • news_source.news -- a list of a feedparser dictionary with latest news presented on the website.
  • news_source.headlines -- a list with latest headlines presented on the website.
  • news_source.print_headlines() -- print headlines of all latest articles.

Each element of news list is a json object with all relevant and available information regarding an article. If you want to know more about the attributes that you can extract from this json, go check the official documentation of feedparser following this link: feedparser_attributes. You can find everything that begins with entries[i]. But be aware that not all the attributes are provided by the news website.

If for some reason you do not like classes, you can always import 2 main methods and use them separately.

from newscatcher import get_news news = get_news('wired.co.uk')

or

from newscatcher import get_headlines news = get_headlines('wired.co.uk')

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newsgather-0.2.0.tar.gz (233.1 kB view details)

Uploaded Source

Built Distribution

newsgather-0.2.0-py3-none-any.whl (233.9 kB view details)

Uploaded Python 3

File details

Details for the file newsgather-0.2.0.tar.gz.

File metadata

  • Download URL: newsgather-0.2.0.tar.gz
  • Upload date:
  • Size: 233.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.1

File hashes

Hashes for newsgather-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d7a7eb167a8a6b7cda9d8751639ff1f56d333b9842f25ac7b1bb44184ba69f1d
MD5 0cc00725db5e298d24332db8441926d4
BLAKE2b-256 029596ddc105b1ecd6b6620ed1a1e7447e70f2e584a010d34cb56f36a2e99d08

See more details on using hashes here.

File details

Details for the file newsgather-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: newsgather-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 233.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.1

File hashes

Hashes for newsgather-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 019e6a8fabcd46d6f44e2aa30099b666ff96dd3fcf25cb84c96b4b9b4d80d077
MD5 6f73f28581c32a80b8473c0625adbd28
BLAKE2b-256 1d1c10d6ebdf15181eded878f2e664cf80a6caf176471c951ff32edf3654afc4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page