Skip to main content

Google News Crawler

Project description

A utility to fetch news articles from Google News.

GNC retrieves the latest items from the Google News feeds and stores them in ElasticSearch or on disk.

Written by Isaac Sijaranamual at the University of Amsterdam/ILPS.

Installation

Google News Crawler can be installed with pip as usual:

pip install google_news_crawler

Usage

Retrieve news items belonging to the ‘science/technology’ topic for the region Botswana from Google News, storing the articles in an ElasticSearch instance:

google_news_crawler --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

You would typically want to run a command like the one above in a crontab to periodically fetch all the items:

# m h  dom mon dow   command
01-59/10 * * * * google_news_crawler --log-config=/path/to/gnc/logging.yaml --datastore=ES --feed="http://news.google.com/news?cf=all&ned=en_bw&output=rss&topic=t&sort=newest"

The complete list of usage options can be obtained with the --help argument:

google_news_crawler --help

Nota Bene

The store-to-disk backend is still available, but has been dropped as a dependency because of a license incompatibility, since warc is licensed under the GPL (version 2).

TODO

  • general

    • make user-agent configurable

    • expand documentation

  • Elasticsearch backend

    • make all ES related settings configurable

    • update metadata for existing documents instead of skipping them entirely

    • improve index mapping for the documents

License

Copyright 2013-2014 Isaac Sijaranamual, University of Amsterdam/ILPS

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this Work or Derivative Works except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

google_news_crawler-0.3.9.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

google_news_crawler-0.3.9-py2-none-any.whl (16.7 kB view details)

Uploaded Python 2

File details

Details for the file google_news_crawler-0.3.9.tar.gz.

File metadata

File hashes

Hashes for google_news_crawler-0.3.9.tar.gz
Algorithm Hash digest
SHA256 7841ad137e3c51bf76e9cde71c921bac1ba4dc082f9f94857b77028789be4336
MD5 7cf160c10f5ac60559d7adae85da3c40
BLAKE2b-256 839d499e6c0c24ffe0ade0655092fdb3742abd201a67500cc9556be7a77e254d

See more details on using hashes here.

File details

Details for the file google_news_crawler-0.3.9-py2-none-any.whl.

File metadata

File hashes

Hashes for google_news_crawler-0.3.9-py2-none-any.whl
Algorithm Hash digest
SHA256 8142acc88cea681628bcfc549db0ccbcfa54d8715bb638fb5653100fc958caa7
MD5 bf559d0e3732537aca03ab3475f90be2
BLAKE2b-256 c09722310b5392066ba055c00848648d815d4f4ba1775960ef1bcd4e739534f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page