Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

import importlib

GoogleNewsScraper = importlib.import_module('google-news-scraper').GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver, automation_options, chrome_driver_arguments)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Name Type Required
automation_options object yes

Possible values:

  • keywords : A series of words that will be inputted into Google News.
  • date_range : Filters how recent data should be. Can be any of the following (defaults to 'Past 24 hours'):
    • Past hours
    • Past 24 hours
    • Past week
    • Past month
    • Past year
    • Archives
  • pages : Number of pages that should be scraped (defaults to 'max').
  • pagination_pause_per_page : Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.

Name Type Required
chrome_driver_arguments list no

Ignore this parameter if you are not choosing to use the default 'chrome' driver

Possible values:

  • '--headless'
  • '--ignore-certificate-errors'
  • '--incognito'
  • '--no-sandbox'
  • '--disable-setuid-sandbox'
  • '--disable-dev-shm-usage'

Click this link to view all possible arguments: https://chromedriver.chromium.org/capabilities


Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name Type Required
driver web driver yes

Possible values:

  • A web driver (Chrome, FireFox, etc)

Name Type Required
element string yes

Possible values:

  • Id selector of an HTML element
  • Class selector of an HTML element

Name Type Required
selector Module import yes

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

Name Type Required
wait_seconds number no

default: 30

Description:

  • Waits a certain number of seconds in order to locate an HTML element
  • If an element exists on the page, it will be located instantaneously
  • If an element does not yet exist, (if it will appear once a request is made, for instance)
  • wait_seconds may have to be increased depending on how long it takes for an element to appear

please note: 30 seconds is plenty; this time would rarely have to be increased


GoogleNewsScraper.scrape()
  • Begins the scraping process and Returns a two-dimensional list
  • Each list represents a single page, and contains multiple objects
  • Each object representing one article

Example of what type of data a single article-object will contain:

  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'article_link': A link to the news article
  • 'time_published_ago': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-0.0.3.tar.gz (8.1 MB view details)

Uploaded Source

Built Distributions

GoogleNewsScraper-0.0.3-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

GoogleNewsScraper-0.0.3-py2-none-any.whl (8.1 MB view details)

Uploaded Python 2

File details

Details for the file GoogleNewsScraper-0.0.3.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-0.0.3.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7b6ed0da397a8d1ab8191712c46f505bdf027581408e039f7cc2bc6e83bfa2bf
MD5 484e6b2bb021fffeb29bfd08d5b7d527
BLAKE2b-256 f9660058fd99e557cc32ddeb360ec6d4bb5879c7edcfc817cd93a46a38617e89

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b2453f3084df57d151f5bae91ce8f5e3bd77b8a3c2a0705a5fa15d7b24b2a37f
MD5 8d419cc604c3e929331de761741238e7
BLAKE2b-256 33c4f623f92b82a5c6d180875e769646383fd1899b0456773a01ea0a626dd765

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-0.0.3-py2-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-0.0.3-py2-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for GoogleNewsScraper-0.0.3-py2-none-any.whl
Algorithm Hash digest
SHA256 3f6bf6b1a5b4adf91b07bf083608800e0712d51f6e43df4c3a8bc728f2cff981
MD5 9e3e51b914ef2a2c28eb1cc9fc05a957
BLAKE2b-256 857dd02e3853bfabbe2633f8a8bea641684f4607374ffb48054f0a9aa0d04ead

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page