Skip to main content
This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Google News Crawler

Project Description

A utility to fetch news articles from Google News.

GNC retrieves the latest items from the Google News feeds and stores them in ElasticSearch or on disk.

Written by Isaac Sijaranamual at the University of Amsterdam/ILPS.

Installation

Google News Crawler can be installed with pip as usual:

pip install google_news_crawler

Usage

Retrieve news items belonging to the ‘science/technology’ topic for the region Botswana from Google News, storing the articles in an ElasticSearch instance:

google_news_crawler --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

You would typically want to run a command like the one above in a crontab to periodically fetch all the items:

# m h  dom mon dow   command
01-59/10 * * * * google_news_crawler --log-config=/path/to/gnc/logging.yaml --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

The complete list of usage options can be obtained with the --help argument:

google_news_crawler --help

Nota Bene

The store-to-disk backend is still available, but has been dropped as a dependency because of a license incompatibility, since warc is licensed under the GPL (version 2).

TODO

  • general
    • make user-agent configurable
    • expand documentation
  • Elasticsearch backend
    • make all ES related settings configurable
    • update metadata for existing documents instead of skipping them entirely
    • improve index mapping for the documents

License

Copyright 2013-2014 Isaac Sijaranamual, University of Amsterdam/ILPS

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this Work or Derivative Works except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Release History

Release History

This version
History Node

0.3.9

History Node

0.3.8

History Node

0.3.7

History Node

0.3.6

History Node

0.3.5

History Node

0.3.4

History Node

0.3.3

History Node

0.3.2

History Node

0.3.1

History Node

0.3.0

Download Files

Download Files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

File Name & Checksum SHA256 Checksum Help Version File Type Upload Date
google_news_crawler-0.3.9-py2-none-any.whl (16.7 kB) Copy SHA256 Checksum SHA256 py2 Wheel Oct 9, 2016
google_news_crawler-0.3.9.tar.gz (23.3 kB) Copy SHA256 Checksum SHA256 Source Oct 9, 2016

Supported By

WebFaction WebFaction Technical Writing Elastic Elastic Search Pingdom Pingdom Monitoring Dyn Dyn DNS Sentry Sentry Error Logging CloudAMQP CloudAMQP RabbitMQ Heroku Heroku PaaS Kabu Creative Kabu Creative UX & Design Fastly Fastly CDN DigiCert DigiCert EV Certificate Rackspace Rackspace Cloud Servers DreamHost DreamHost Log Hosting