Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)
Name Type Required Description
driver web driver yes A web driver (Chrome, FireFox, etc)
element string yes Id or class selector of an HTML element
selector Module import yes see below
wait_seconds int no Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None
Name Type Required Description
search_text str yes A series of word(s) that will be inputted into the Google search engine
date_range str no Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages str or int no Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page int no Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb function no Will return all article data on a single page for every page scraped (defaults to False)
  • Example using 'cb' paramater:
def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, cb=handle_page_data)

NOTE:

  • If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
  • Each list will contain an object of news article data for every news article on that page

Example of the data that every article-object will contain:

  • 'id': A unique id for every article data object
  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'url': A link to the news article
  • 'date_time': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-1.0.0.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

GoogleNewsScraper-1.0.0-py3-none-any.whl (8.2 MB view details)

Uploaded Python 3

File details

Details for the file GoogleNewsScraper-1.0.0.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-1.0.0.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for GoogleNewsScraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c978fb732c01efc95f95fa32342d3b95b9350b58c8d057232e0b7281faddd469
MD5 2a5b1f897511574d916b6c012374a750
BLAKE2b-256 df931f1db20e69a8f8d6c4d4d8a2bc72797eebf92cf83e2ee91bccb5c021d434

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.6

File hashes

Hashes for GoogleNewsScraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0666d7d969ec14d6305913ba440e7cb0e28bb8cb1eb49f6be3c23e9b2279b556
MD5 b0e1702fffb753ce39e734c2dae22b64
BLAKE2b-256 0517f1a0f7de7067b02f7fff04838dd6308caf0aa1cac081cb6f4fd2d8c68a93

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page