Skip to main content

Get the normalized latest news from (almost) any website

Project description

Newscatcher

Programmatically collect normalized news from (almost) any website. By newscatcherapi.com.

Demo

Motivation

While working on newscatcherapi -- JSON API to query the news articles, I came up with an idea to make a simple Python package that would allow to easily grab the live news data.

When I used to be a junior data scientist working on my own side projects, it was difficult for me to operate with external data sources. I knew Python quite well, but in most cases it was not enough to build proper data pipelines that required gathering data on my own.

Even though I do not recommend to use this package for any production systems, I believe that it should be enough to test your assumptions and build some MVPs.

Installation

pip install newscatcher

Tech/framework used

The package itself is nothing more than a SQLite database with RSS feed endpoints for each website and some basic wrapper of feedparser.

Code Example/Documentation

Let's review all possible usage of the package.

In its core, it has a class called Newscatcher. This class is all you need in order to get latest news.

After installing your package, import the class:

from newscatcher import Newscatcher

Now you just need to put a url of a desired news source as an input into our class. Please take the base form url of a website (without www.,neither https://, nor / at the end of url).

For example: “nytimes”.com, “news.ycombinator.com” or “theverge.com”.

news_source = Newscatcher('blackfaldslife.com')

If you have done it right and the source that you chose is presented in our database, you will get a variable with 3 components and 1 method:

  • news_source.website -- the same string that you entered inside the class.
  • news_source.news -- a list of a feedparser dictionary with latest news presented on the website.
  • news_source.headlines -- a list with latest headlines presented on the website.
  • news_source.print_headlines() -- print headlines of all latest articles.

Each element of news list is a json object with all relevant and available information regarding an article. If you want to know more about the attributes that you can extract from this json, go check the official documentation of feedparser following this link: feedparser_attributes. You can find everything that begins with entries[i]. But be aware that not all the attributes are provided by the news website.

If for some reason you do not like classes, you can always import 2 main methods and use them separately.

from newscatcher import get_news news = get_news('wired.co.uk')

or

from newscatcher import get_headlines news = get_headlines('wired.co.uk')

Licence

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newscatcher-0.1.0.tar.gz (101.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newscatcher-0.1.0-py3-none-any.whl (100.4 kB view details)

Uploaded Python 3

File details

Details for the file newscatcher-0.1.0.tar.gz.

File metadata

  • Download URL: newscatcher-0.1.0.tar.gz
  • Upload date:
  • Size: 101.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.3 CPython/3.7.4 Darwin/19.3.0

File hashes

Hashes for newscatcher-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2918e1a03125041e9222de360f9fa90e53c9a25b9dfc904520644cd594aa64d1
MD5 9f697c4344d1bf998c746e5c77fc6ef4
BLAKE2b-256 880aa2723c7ab2db26a0ca70bb8a474f644c2e84aefab5930f382db414a83e94

See more details on using hashes here.

File details

Details for the file newscatcher-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: newscatcher-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 100.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.0.3 CPython/3.7.4 Darwin/19.3.0

File hashes

Hashes for newscatcher-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99601e3249e094ef4393fba5c6c1cf556dcecb40c2d761d9f7fc399e53da883e
MD5 e0e5bbd4b4a99e2705cfc06c385e7938
BLAKE2b-256 d339b1ad908ca92d6344f1bfec09b5f39b0faad9f8f6bf5a713f20be801240c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page