Google News Crawler
Project description
A utility to fetch news articles from Google News.
GNC retrieves the latest items from the Google News feeds and stores them in ElasticSearch or on disk.
Written by Isaac Sijaranamual, copyright 2013 University of Amsterdam/ILPS, licensed under the Apache License, Version 2.0.
Installation
Google News Crawler can be installed with pip as usual:
pip install google_news_crawler
Usage
Retrieve news items belonging to the ‘science/technology’ topic for the region Botswana from Google News, storing the articles in an ElasticSearch instance:
google_news_crawler --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"
You would typically want to run a command like the one above in a crontab to periodically fetch all the items:
# m h dom mon dow command 01-59/10 * * * * google_news_crawler --log-config=/path/to/gnc/logging.yaml --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"
The complete list of usage options can be obtained with the --help argument:
google_news_crawler --help
Nota Bene
The store-to-disk backend is still available, but has been dropped as a dependency because of a license incompatibility since warc licensed under the GPL (version 2).
TODO
general
make user-agent configurable
expand documentation
Elasticsearch backend
set up proper index mapping for the documents
make all ES related settings conigurable
update metadata for retrieved documents instead of skipping them entirely
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for google_news_crawler-0.3.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3977903956ecbd5332516c87b8d2910ac40142a8c2613ae0d70618470258269a |
|
MD5 | 027960c1a565f72ec96df3c4f65f97e5 |
|
BLAKE2b-256 | ed8f9e95d7d057f8ac13c5a30238034d2061e1b16ce1d6d2362699e1985d097c |