Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)
Name Type Required Description
driver web driver yes A web driver (Chrome, FireFox, etc)
element string yes Id or class selector of an HTML element
selector Module import yes see below
wait_seconds int no Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb)
Name Type Required Description
search_text str yes A series of word(s) that will be inputted into the Google search engine
date_range str no Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages str or int no Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page int no Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb function no Will return all article data on a single page for every page scraped (defaults to False)
  • Example using 'cb' paramater:
def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, handle_page_data)

NOTE:

  • If no argument is provided fro 'cb,' the scrape method will return a two-dimensional list
  • Each list will contain an object of news article data for every news article on that page

Example of what type of data that a single article-object will contain:

  • 'id': A unique id for every article data object
  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'url': A link to the news article
  • 'date_time': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-0.0.8.tar.gz (8.1 MB view details)

Uploaded Source

Built Distribution

GoogleNewsScraper-0.0.8-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

File details

Details for the file GoogleNewsScraper-0.0.8.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-0.0.8.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for GoogleNewsScraper-0.0.8.tar.gz
Algorithm Hash digest
SHA256 e23be172cf424e3263bf30b0b3743f932e45a9cb8b8f1670fdefaab694c898a7
MD5 965ae1224e53e8eaae68456b6450aefe
BLAKE2b-256 8618813febad1bb6d42c48e366b8fecdc5fe02d1e0146d8734b11ec86bd8571d

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for GoogleNewsScraper-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f126d9db85e8bb1e68feb02db20aa3143710f7b75ea4cc88e64e0669c500691c
MD5 415ec0f918462ab3c12fc32b70fff62b
BLAKE2b-256 a6dfb7d397f6dc5c073ecc0df17aa5a960a47b8d6cadaa401d1ef72f7af2d95a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page