Scrapes Google News article data
Project description
googlenewsscraper
Getting Started
Installation
pip install GoogleNewsScraper
Reference
Importing
import importlib
GoogleNewsScraper = importlib.import_module('google-news-scraper').GoogleNewsScraper
Instantiating Scraper
GoogleNewsScraper(driver, automation_options, chrome_driver_arguments)
Constructor Parameters
Name | Type | Required |
---|---|---|
driver | web driver | no |
Possible values:
'chrome'
: The driver will default to use this package's chrome driver- A path to some driver (FireFox, for instance) stored on the user's system
Name | Type | Required |
---|---|---|
automation_options | object | yes |
Possible values:
keywords
: A series of words that will be inputted into Google News.date_range
: Filters how recent data should be. Can be any of the following (defaults to 'Past 24 hours'):- Past hours
- Past 24 hours
- Past week
- Past month
- Past year
- Archives
pages
: Number of pages that should be scraped (defaults to 'max').pagination_pause_per_page
: Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
Name | Type | Required |
---|---|---|
chrome_driver_arguments | list | no |
Ignore this parameter if you are not choosing to use the default 'chrome' driver
Possible values:
'--headless'
'--ignore-certificate-errors'
'--incognito'
'--no-sandbox'
'--disable-setuid-sandbox'
'--disable-dev-shm-usage'
Click this link to view all possible arguments: https://chromedriver.chromium.org/capabilities
Methods
This method is both public and private, though it really should only be used by the class
locate_html_element(self, driver, element, selector, wait_seconds)
Name | Type | Required |
---|---|---|
driver | web driver | yes |
Possible values:
- A web driver (Chrome, FireFox, etc)
Name | Type | Required |
---|---|---|
element | string | yes |
Possible values:
- Id selector of an HTML element
- Class selector of an HTML element
Name | Type | Required |
---|---|---|
selector | Module import | yes |
First install selenium
pip install selenium
Then import By
from selenium.webdriver.common.by import By
Possible values:
By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH
Name | Type | Required |
---|---|---|
wait_seconds | number | no |
default: 30
Description:
- Waits a certain number of seconds in order to locate an HTML element
- If an element exists on the page, it will be located instantaneously
- If an element does not yet exist, (if it will appear once a request is made, for instance)
- wait_seconds may have to be increased depending on how long it takes for an element to appear
please note: 30 seconds is plenty; this time would rarely have to be increased
GoogleNewsScraper.scrape()
- Begins the scraping process and Returns a two-dimensional list
- Each list represents a single page, and contains multiple objects
- Each object representing one article
Example of what type of data a single article-object will contain:
'description'
: The preview description of the news article'title'
: The title of the news article'source'
: The source of news article (New York Times, for instance)'image_url'
: The url of the preview news article image'article_link'
: A link to the news article'time_published_ago'
: A datetime string that represents the date of when the article was published
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file GoogleNewsScraper-0.0.3.tar.gz
.
File metadata
- Download URL: GoogleNewsScraper-0.0.3.tar.gz
- Upload date:
- Size: 8.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b6ed0da397a8d1ab8191712c46f505bdf027581408e039f7cc2bc6e83bfa2bf |
|
MD5 | 484e6b2bb021fffeb29bfd08d5b7d527 |
|
BLAKE2b-256 | f9660058fd99e557cc32ddeb360ec6d4bb5879c7edcfc817cd93a46a38617e89 |
File details
Details for the file GoogleNewsScraper-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: GoogleNewsScraper-0.0.3-py3-none-any.whl
- Upload date:
- Size: 8.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.3 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.0 CPython/3.9.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2453f3084df57d151f5bae91ce8f5e3bd77b8a3c2a0705a5fa15d7b24b2a37f |
|
MD5 | 8d419cc604c3e929331de761741238e7 |
|
BLAKE2b-256 | 33c4f623f92b82a5c6d180875e769646383fd1899b0456773a01ea0a626dd765 |
File details
Details for the file GoogleNewsScraper-0.0.3-py2-none-any.whl
.
File metadata
- Download URL: GoogleNewsScraper-0.0.3-py2-none-any.whl
- Upload date:
- Size: 8.1 MB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.1 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f6bf6b1a5b4adf91b07bf083608800e0712d51f6e43df4c3a8bc728f2cff981 |
|
MD5 | 9e3e51b914ef2a2c28eb1cc9fc05a957 |
|
BLAKE2b-256 | 857dd02e3853bfabbe2633f8a8bea641684f4607374ffb48054f0a9aa0d04ead |