Get the normalized latest news from (almost) any website

These details have not been verified by PyPI

Project links

Homepage

Project description

Newscatcher

Programmatically collect normalized news from (almost) any website.

Filter by topic, country, or language.

By newscatcherapi.com (this package is fully self-sufficient, you can just use it. No dependency on external services/API)

Demo

Motivation

While working on newscatcherapi - JSON API to query news articles, I came up with an idea to make a simple Python package that would allow to easily grab the live news data.

When I used to be a junior data scientist working on my own side projects, it was difficult for me to operate with external data sources. I knew Python quite well, but in most cases it was not enough to build proper data pipelines that required gathering data on my own. I hope that this package will help you with your next project.

Even though I do not recommend to use this package for any production systems, I believe that it should be enough to test your assumptions and build some MVPs.

Installation

pip install newscatcher --upgrade

Quick Start

from newscatcher import Newscatcher

Get the latest news from nytimes.com (we support thousands of news websites, try yourself!) main news feed

nc = Newscatcher(website = 'nytimes.com')
results = nc.get_news()

# results.keys()
# 'url', 'topic', 'language', 'country', 'articles'

# Get the articles
articles = results['articles']

first_article_summary = articles[0]['summary']
first_article_title = articles[0]['title']

Get the latest news from nytimes.com politics feed

nc = Newscatcher(website = 'nytimes.com', topic = 'politics')

results = nc.get_news()
articles = results['articles']

There is a limited set of topic that you might find:

'tech', 'news', 'business', 'science', 'finance', 'food', 'politics', 'economics', 'travel', 'entertainment', 'music', 'sport', 'world'

However, not all topics are supported by every newspaper.

How to check which topics are supported by which newspaper:

from newscatcher import describe_url

describe = describe_url('nytimes.com')

print(describe['topics'])

Get the list of all news feeds by topic/language/country

If you want to find the full list of supported news websites you can always do so using urls() function

from newscatcher import urls

# URLs by TOPIC
politic_urls = urls(topic = 'politics')

# URLs by COUNTRY
american_urls = urls(country = 'US')

# URLs by LANGUAGE
english_urls = urls(language = 'en')

# Combine any from topic, country, language
american_english_politics_urls = urls(country = 'US', topic = 'politics', language = 'en') 

# note some websites do not explicitly declare their language 
# as a result they will be excluded from queries based on language

Documentation

`Newscatcher` Class

from newscatcher import Newscatcher

Newscatcher(website, topic = None)

Please take the base form url of a website (without www.,neither https://, nor / at the end of url).

For example: “nytimes”.com, “news.ycombinator.com” or “theverge.com”.

Newscatcher.get_news() - Get the latest news from the website of interest.

Allowed topics: tech, news, business, science, finance, food, politics, economics, travel, entertainment, music, sport, world

If no topic is provided, the main feed is returned.

Returns a dictionary of 5 elements:

url - URL of the website
topic - topic of the returned feed
language - language of returned feed
country - country of returned feed
articles - articles of the feed. Feedparser object

Newscatcher.get_headlines() - Returns only the headlines

Newscatcher.print_headlines(n) - Print top n headlines

`describe_url()` & `urls()`

Those functions exist to help you navigate through this package

from newscatcher import describe_url

describe_url(website) - Get the main info on the website.

Returns a dictionary of 5 elements:

url - URL of the website
topics - list of all supported topics
language - language of website
country - country of returned feed
main_topic - main topic of a website

from newscatcher import urls

urls(topic = None, language = None, country = None) - Get a list of all supported news websites given any combination of topic, language, country

Returns a list of websites that match your combination of topic, language, country

Supported topics: tech, news, business, science, finance, food, politics, economics, travel, entertainment, music, sport, world

Supported countries: US, GB, DE, FR, IN, RU, ES, BR, IT, CA, AU, NL, PL, NZ, PT, RO, UA, JP, AR, IR, IE, PH, IS, ZA, AT, CL, HR, BG, HU, KR, SZ, AE, EG, VE, CO, SE, CZ, ZH, MT, AZ, GR, BE, LU, IL, LT, NI, MY, TR, BM, NO, ME, SA, RS, BA

Supported languages: EL, IT, ZH, EN, RU, CS, RO, FR, JA, DE, PT, ES, AR, HE, UK, PL, NL, TR, VI, KO, TH, ID, HR, DA, BG, NO, SK, FA, ET, SV, BN, GU, MK, PA, HU, SL, FI, LT, MR, HI

Tech/framework used

The package itself is nothing more than a SQLite database with RSS feed endpoints for each website and some basic wrapper of feedparser.

About Us

We are Newscatcher API team. We are glad that you liked our package.

If you want to search for any news data, consider using our API

Artem Bugara - co-founder of Newscatcher, made v.0.1.0

Maksym Sugonyaka - co-founder of Newscatcher, made v.0.1.0

Becket Trotter - Python Developer, made v.0.2.0

Licence

MIT

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.2.0

May 20, 2020

0.1.0

Feb 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newscatcher-0.2.0.tar.gz (140.9 kB view details)

Uploaded May 20, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

newscatcher-0.2.0-py3-none-any.whl (138.6 kB view details)

Uploaded May 20, 2020 Python 3

File details

Details for the file newscatcher-0.2.0.tar.gz.

File metadata

Download URL: newscatcher-0.2.0.tar.gz
Upload date: May 20, 2020
Size: 140.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.3 CPython/3.7.4 Darwin/19.4.0

File hashes

Hashes for newscatcher-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a83f56b2b9883780f196984043134aec3d61fb61e2c56dba6f307b80c503fc9b`
MD5	`965dd3e8545e414cf72e231496956211`
BLAKE2b-256	`7b2af6b9bcc35c305a6ca8371a0ddb4ec2ac97d9248a19b16f068688daac2063`

See more details on using hashes here.

File details

Details for the file newscatcher-0.2.0-py3-none-any.whl.

File metadata

Download URL: newscatcher-0.2.0-py3-none-any.whl
Upload date: May 20, 2020
Size: 138.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.0.3 CPython/3.7.4 Darwin/19.4.0

File hashes

Hashes for newscatcher-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6051b6b709717232ccd8f74ff96eccda2b603eb0366ee812b915f6dfd58fa300`
MD5	`9243a566f80db1c4df8cbdefd259efa8`
BLAKE2b-256	`83ba37b16ef7c53a3723224123e749d324a18d5066411b9e132cc90585eaadd1`

See more details on using hashes here.

newscatcher 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Newscatcher

Demo

Motivation

Installation

Quick Start

Get the list of all news feeds by topic/language/country

Documentation

`Newscatcher` Class

`describe_url()` & `urls()`

Tech/framework used

About Us

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

newscatcher 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Newscatcher

Demo

Motivation

Installation

Quick Start

Get the list of all news feeds by topic/language/country

Documentation

Newscatcher Class

describe_url() & urls()

Tech/framework used

About Us

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`Newscatcher` Class

`describe_url()` & `urls()`