Scrapes Google News article data
Project description
googlenewsscraper
Getting Started
Installation
pip install GoogleNewsScraper
Reference
Importing
import importlib
GoogleNewsScraper = importlib.import_module('google-news-scraper').GoogleNewsScraper
Instantiating Scraper
GoogleNewsScraper(driver, automation_options, chrome_driver_arguments)
Constructor Parameters
Name | Type | Required |
---|---|---|
driver | web driver | no |
Possible values:
'chrome'
: The driver will default to use this package's chrome driver- A path to some driver (FireFox, for instance) stored on the user's system
Name | Type | Required |
---|---|---|
automation_options | object | yes |
Possible values:
keywords
: A series of words that will be inputted into Google News.date_range
: Filters how recent data should be. Can be any of the following (defaults to 'Past 24 hours'):- Past hours
- Past 24 hours
- Past week
- Past month
- Past year
- Archives
pages
: Number of pages that should be scraped (defaults to 'max').pagination_pause_per_page
: Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
Name | Type | Required |
---|---|---|
chrome_driver_arguments | list | no |
Ignore this parameter if you are not choosing to use the default 'chrome' driver
Possible values:
'--headless'
'--ignore-certificate-errors'
'--incognito'
'--no-sandbox'
'--disable-setuid-sandbox'
'--disable-dev-shm-usage'
Click this link to view all possible arguments: https://chromedriver.chromium.org/capabilities
Methods
This method is both public and private, though it really should only be used by the class
locate_html_element(self, driver, element, selector, wait_seconds)
Name | Type | Required |
---|---|---|
driver | web driver | yes |
Possible values:
- A web driver (Chrome, FireFox, etc)
Name | Type | Required |
---|---|---|
element | string | yes |
Possible values:
- Id selector of an HTML element
- Class selector of an HTML element
Name | Type | Required |
---|---|---|
selector | Module import | yes |
First install selenium
pip install selenium
Then import By
from selenium.webdriver.common.by import By
Possible values:
By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH
Name | Type | Required |
---|---|---|
wait_seconds | number | no |
default: 30
Description:
- Waits a certain number of seconds in order to locate an HTML element
- If an element exists on the page, it will be located instantaneously
- If an element does not yet exist, (if it will appear once a request is made, for instance)
- wait_seconds may have to be increased depending on how long it takes for an element to appear
please note: 30 seconds is plenty; this time would rarely have to be increased
GoogleNewsScraper.scrape()
- Begins the scraping process and Returns a two-dimensional list
- Each list represents a single page, and contains multiple objects
- Each object representing one article
Example of what type of data a single article-object will contain:
'description'
: The preview description of the news article'title'
: The title of the news article'source'
: The source of news article (New York Times, for instance)'image_url'
: The url of the preview news article image'article_link'
: A link to the news article'time_published_ago'
: A datetime string that represents the date of when the article was published
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Close
Hashes for GoogleNewsScraper-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2453f3084df57d151f5bae91ce8f5e3bd77b8a3c2a0705a5fa15d7b24b2a37f |
|
MD5 | 8d419cc604c3e929331de761741238e7 |
|
BLAKE2b-256 | 33c4f623f92b82a5c6d180875e769646383fd1899b0456773a01ea0a626dd765 |
Close
Hashes for GoogleNewsScraper-0.0.3-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f6bf6b1a5b4adf91b07bf083608800e0712d51f6e43df4c3a8bc728f2cff981 |
|
MD5 | 9e3e51b914ef2a2c28eb1cc9fc05a957 |
|
BLAKE2b-256 | 857dd02e3853bfabbe2633f8a8bea641684f4607374ffb48054f0a9aa0d04ead |