Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver, driver_options)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Name Type Required
driver_options list no

Ignore this parameter if you are not choosing to use the default 'chrome' driver

Possible values (ONLY for Chrome driver):

  • '--headless'
  • '--ignore-certificate-errors'
  • '--incognito'
  • '--no-sandbox'
  • '--disable-setuid-sandbox'
  • '--disable-dev-shm-usage'

Click this link to view all possible arguments (ONLY for Chrome driver): https://chromedriver.chromium.org/capabilities


Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name Type Required
driver web driver yes

Possible values:

  • A web driver (Chrome, FireFox, etc)

Name Type Required
element string yes

Possible values:

  • Id selector of an HTML element
  • Class selector of an HTML element

Name Type Required
selector Module import yes

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

Name Type Required
wait_seconds number no

default: 30

Description:

  • Waits a certain number of seconds in order to locate an HTML element
  • If an element exists on the page, it will be located instantaneously
  • If an element does not yet exist, (if it will appear once a request is made, for instance)
  • wait_seconds may have to be increased depending on how long it takes for an element to appear

please note: 30 seconds is plenty; this time would rarely have to be increased


GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb)

Name Type Required
search_text str yes

Descrption: A series of word(s) that will be inputted into the Google search engine


Name Type Required
date_range str no

Description: Filters how recent data should be (defaults to 'Past 24 hours')

Possible values:

  • Past hours
  • Past 24 hours
  • Past week
  • Past month
  • Past year
  • Archives

Name Type Required
pages str or int no

Descrption: Number of pages that should be scraped (defaults to 'max').


Name Type Required
pagination_pause_per_page int no

Descrption: Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.


Name Type Required
cb function no

Descrption:

  • Will return all article data on a single page for every page scraped (defaults to False)

  • Example:

def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, handle_page_data)

NOTE:

  • If no argument is provided fro 'cb,' the scrape method will return a two-dimensional list
  • Each list will contain an object of news article data for every news article on that page

Example of what type of data that a single article-object will contain:

  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'article_link': A link to the news article
  • 'time_published_ago': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-0.0.5.tar.gz (8.1 MB view details)

Uploaded Source

Built Distribution

GoogleNewsScraper-0.0.5-py3-none-any.whl (8.1 MB view details)

Uploaded Python 3

File details

Details for the file GoogleNewsScraper-0.0.5.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-0.0.5.tar.gz
  • Upload date:
  • Size: 8.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for GoogleNewsScraper-0.0.5.tar.gz
Algorithm Hash digest
SHA256 74ef541db2bd6b2973b7b17b0387283200d428b0dda2ecd98d7e698a0e6c298d
MD5 ce76c9195e26ab3a426bd3ea727c74ed
BLAKE2b-256 b7b2f9c58d42527f16cb5ef1687145f9da90d6ca9c368519496ac9b18b78b5a7

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 8.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2

File hashes

Hashes for GoogleNewsScraper-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c53d53734d8ecc8225aee967e326f65e741d651c9b9e6548a64bc6f60d95aa0e
MD5 4aee60b1b7041ab3971f375a96037ab3
BLAKE2b-256 d1c96b4803bd05c386c06fd61f676bcf3f2d39ff45edc586132e9ca626c6cd97

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page