Skip to main content

Google News Crawler

Project description

A utility to fetch news articles from Google News.

GNC retrieves the latest items from the Google News feeds and stores them in ElasticSearch or on disk.

Written by Isaac Sijaranamual, copyright 2013 University of Amsterdam/ILPS, licensed under the Apache License, Version 2.0.

Installation

Google News Crawler can be installed with pip as usual:

pip install google_news_crawler

Usage

Retrieve news items belonging to the ‘science/technology’ topic for the region Botswana from Google News, storing the articles in an ElasticSearch instance:

google_news_crawler --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

You would typically want to run a command like the one above in a crontab to periodically fetch all the items:

# m h  dom mon dow   command
01-59/10 * * * * google_news_crawler --log-config=/path/to/gnc/logging.yaml --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

The complete list of usage options can be obtained with the --help argument:

google_news_crawler --help

Nota Bene

The store-to-disk backend is still available, but has been dropped as a dependency because of a license incompatibility since warc licensed under the GPL (version 2).

TODO

  • general
    • make user-agent configurable
    • expand documentation
  • Elasticsearch backend
    • set up proper index mapping for the documents
    • make all ES related settings conigurable
    • update metadata for retrieved documents instead of skipping them entirely

Project details


Release history Release notifications

History Node

0.3.9

History Node

0.3.8

History Node

0.3.7

History Node

0.3.6

This version
History Node

0.3.5

History Node

0.3.4

History Node

0.3.3

History Node

0.3.2

History Node

0.3.1

History Node

0.3.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
google_news_crawler-0.3.5.tar.gz (19.2 kB) Copy SHA256 hash SHA256 Source None Mar 15, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page