Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)
Name Type Required Description
driver web driver yes A web driver (Chrome, FireFox, etc)
element string yes Id or class selector of an HTML element
selector Module import yes see below
wait_seconds int no Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None
Name Type Required Description
search_text str yes A series of word(s) that will be inputted into the Google search engine
date_range str no Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages str or int no Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page int no Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb function no Will return all article data on a single page for every page scraped (defaults to False)
  • Example using 'cb' paramater:
def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, handle_page_data)

NOTE:

  • If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
  • Each list will contain an object of news article data for every news article on that page

Example of what type of data that a single article-object will contain:

  • 'id': A unique id for every article data object
  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'url': A link to the news article
  • 'date_time': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-0.0.9.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

GoogleNewsScraper-0.0.9-py3-none-any.whl (8.2 MB view details)

Uploaded Python 3

File details

Details for the file GoogleNewsScraper-0.0.9.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-0.0.9.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-0.0.9.tar.gz
Algorithm Hash digest
SHA256 841e0fb9f7d522fde56fefc8d67ad40ad54d084cec0aa04f3a0a14ce3bc000b1
MD5 97a5e34158bb1316628523e238209d8c
BLAKE2b-256 1d9b43687e0d76a92769194beb2ea0190b44f1949214712da520af46b8eeff70

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 cb788cfe1c013e4e6e1c5e4e3197915abe4f1f9c237565e85480da6091c25943
MD5 cf61aa8436f6f4aef50feddb0dbb5c8f
BLAKE2b-256 59d6e97443588b2180f89853ebfa4197a3656c5c9613bd57c507dc67c87dbf11

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page