Skip to main content

Scrapes Google News article data

Project description

googlenewsscraper

Getting Started

Installation

$ pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name Type Required
driver web driver no

Possible values:

  • 'chrome': The driver will default to use this package's chrome driver
  • A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)
Name Type Required Description
driver web driver yes A web driver (Chrome, FireFox, etc)
element string yes Id or class selector of an HTML element
selector Module import yes see below
wait_seconds int no Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

$ pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

  • By.ID
  • By.CLASS_NAME
  • By.CSS_SELECTOR
  • By.LINK_TEXT
  • By.NAME
  • By.PARTIAL_LINK_TEXT
  • By.TAG_NAME
  • By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None
Name Type Required Description
search_text str yes A series of word(s) that will be inputted into the Google search engine
date_range str no Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages str or int no Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page int no Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb function no Will return all article data on a single page for every page scraped (defaults to False)
  • Example using 'cb' paramater:
def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, cb=handle_page_data)

NOTE:

  • If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
  • Each list will contain an object of news article data for every news article on that page

Example of the data that every article-object will contain:

  • 'id': A unique id for every article data object
  • 'description': The preview description of the news article
  • 'title': The title of the news article
  • 'source': The source of news article (New York Times, for instance)
  • 'image_url': The url of the preview news article image
  • 'url': A link to the news article
  • 'date_time': A datetime string that represents the date of when the article was published

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GoogleNewsScraper-1.0.1.tar.gz (8.2 MB view details)

Uploaded Source

Built Distribution

GoogleNewsScraper-1.0.1-py3-none-any.whl (8.2 MB view details)

Uploaded Python 3

File details

Details for the file GoogleNewsScraper-1.0.1.tar.gz.

File metadata

  • Download URL: GoogleNewsScraper-1.0.1.tar.gz
  • Upload date:
  • Size: 8.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-1.0.1.tar.gz
Algorithm Hash digest
SHA256 3619faa6a19c8105a6afe406bcdb084eb4c8332ee31bf17628d43547cb9bceab
MD5 c3e20d6e8bbce39ae61720d35092327f
BLAKE2b-256 5c22d015f0e701173968fd251a35ad42707db7a238be1a02576339c927b2b8fb

See more details on using hashes here.

File details

Details for the file GoogleNewsScraper-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: GoogleNewsScraper-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1

File hashes

Hashes for GoogleNewsScraper-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a7fd4222f6321f37906501daff7ca7f7cb661d3ea893bab2fbd2d7a760b1ce36
MD5 aad987fae87c7d7d7bca9d64573c0508
BLAKE2b-256 491d5d711be8e2cbd5c929c78136406bc5647af8856fbb7c8d8cc5e875c3c997

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page