A library for web scraping.
Project description
keyscraper Package Documentation
This library provides various functions which simplifies webpage scraping.
There are three modules in this package.
- utils - basic utilities
- staticscraper - used to scrape raw html data
- dynamicscraper - used to scrape html data rendered by JavaScript
To install this package, type in command prompt:
pip install keyscraper
[1] Basic Utilities
(1-A) TimeName - Generating a file name composed of the current time:
TimeName(mode = "default")
argument | optional | default | available |
---|---|---|---|
mode | yes | TimeName.MODE_KEYWIND | TimeName.MODE_KEYWIND, TimeName.MODE_DATETIME, "default" |
self.get_name(basename = "", extension = "", f_datetime = None)
argument | optional | default | available |
---|---|---|---|
basename | yes | "" | [ string type ] |
extension | yes | "" | [ string type ] |
f_datetime | no | [ string type ] |
There are two available modes: "keywind" and "datetime". By default, "keywind" is used.
In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.
Jan. | Feb. | Mar. | Apr. | May | Jun. | Jul. | Aug. | Sep. | Oct. | Nov. | Dec. |
---|---|---|---|---|---|---|---|---|---|---|---|
i | f | m | a | M | j | J | A | s | o | n | d |
For example, on December 7th of 2000, D-d072000 will be the resulting date string.
In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.
For example, at 05:43:07 PM., the resulting time string will be T-174307.
For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.
In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.
(1-A-1) Example of using TimeName (mode: keywind).
from keyscraper.utils import TimeName
mode = TimeName.MODE_KEYWIND # or TimeName.MODE_DATETIME
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension)
print(timename) # "images_D-d072000_T-012345.jpg"
(1-A-2) Example of using TimeName (mode: datetime).
from keyscraper.utils import TimeName
mode = TimeName.MODE_DATETIME # or TimeName.MODE_KEYWIND
format_string = "_%y%m%d-%H%M%S"
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension, format_string)
print(timename) # "images_001207-012345.jpg"
(1-B) FileName - Dividing a filename into folder, file and extension:
FileName(filename, mode = "default")
argument | optional | default | available |
---|---|---|---|
filename | no | [ string type ] | |
mode | yes | FileName.MODE_FORWARDSLASH | FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH |
self.__getitem__(key = "all")
argument | optional | default | available |
---|---|---|---|
key | no | "all" | "all", "folder", "name", "extension" |
(1-B-1) Example of using FileName
from keyscraper.utils import FileName
mode = FileName.MODE_FORWARDSLASH
filename = "C:/Users/VIN/Desktop/utils.py"
name_object = FileName(filename)
full_name = name_object["all"]
file_name = name_object["name"]
folder_name = name_object["folder"]
extension = name_object["extension"]
print(full_name) # "C:/Users/VIN/Desktop/utils.py"
print(folder_name) # "C:/Users/VIN/Desktop/"
print(file_name) # "utils"
print(extension) # ".py"
(1-C) FileRetrieve - Downloading a file from a direct URL:
FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)
argument | optional | default | available |
---|---|---|---|
directlink | no | [ string type ] | |
filename | yes | [ string type ] | |
buffer | yes | 4096 | [ integer (>0) type ] |
progress_bar | yes | False | True, False |
overwrite | yes | None | None, True, False |
If overwrite is None, the programmer will be asked to enter (Y/N) on each download.
self.simple_retrieve()
Calling this function will download the file from the target URL and save it to disk with the provided filename.
(1-C-1) Example of using FileRetrieve
from keyscraper.utils import FileRetrieve
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progress_bar = True
overwrite = True
downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
downloader.simple_retrieve()
(1-D) ImageGrabber - Downloading an image from a direct URL:
ImageGrabber(filename, progressBar = False, url_timeout = None)
argument | optional | default | available |
---|---|---|---|
filename | no | [ string type ] | |
progressBar | yes | False | True, False |
url_timeout | yes | 600 | [ integer (>0) type ] |
The URL request will be open for a maximum of url_timeout seconds.
self.retrieve(directlink, overwrite = None, timeout = None)
argument | optional | default | available |
---|---|---|---|
directlink | no | [ string type ] | |
overwrite | yes | None | None, True, False |
timeout | yes | None | None, [ integer (>0) type ] |
If the image hasn't finished downloading in timeout seconds, the process will terminate.
If overwrite is None, the programmer will be asked to enter (Y/N) on each download.
(1-D-1) Example of using ImageGrabber
from keyscraper.utils import ImageGrabber
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progressBar = True
url_timeout = 60
downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
downloader.retrieve(url, overwrite = True, timeout = 15)
[2] Static Scraper
(2-A) SSFormat - Defining the node attributes to scrape:
SSFormat(element_type, **kwargs)
argument | optional | default | available |
---|---|---|---|
element_type | no | [ string type ] | |
search_type | yes | None | None, [ string type ] |
search_clue | yes | None | None, [ string type ] |
multiple | yes | False | True, False |
extract | yes | None | None, [ function (1-arg) type ] |
format | yes | None | None, [ function (1-arg) type ] |
nickname | yes | None | None, [ string type ] |
filter | yes | None | None, [ function (1-arg) type ] |
keep | yes | True | True, False |
self.__getitem__(key)
argument | optional | default | available |
---|---|---|---|
key | no | "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep" |
self.get_value(key)
argument | optional | default | available |
---|---|---|---|
key | no | "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep" |
(2-B) SSInfo - Defining information needed for scraping:
SSInfo(f_site, f_page, f_item, f_attr)
argument | optional | default | available |
---|---|---|---|
f_site | no | [ string type ] | |
f_page | no | [ string type ] | |
f_item | no | [ SSFormat type ] | |
f_attr | no | [ list-SSFormat type ] |
self.__getitem__(key)
argument | optional | default | available |
---|---|---|---|
key | no | "f_site", "f_page", "f_item", "f_attr" |
self.format_page(page)
argument | optional | default | available |
---|---|---|---|
page | no | [ integer/string type ] |
If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".
(2-C) StaticScraper - Scraping a static webpage:
StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)
argument | optional | default | available |
---|---|---|---|
info | no | [ SSInfo type ] | |
filename | yes | None | None, [ string type ] |
mode | yes | StaticScraper.MODE_FILE | StaticScraper.MODE_FILE, StaticScraper.MODE_READ |
timesleep | yes | 0 | [ integer/float (>=0) type ] |
buffer | yes | 100 | [ integer (>0) type ] |
self.scrape(start = 1, pages = 1)
argument | optional | default | available |
---|---|---|---|
start | yes | 1 | [ integer (>0) type ] |
pages | yes | 1 | [ integer (>0) type ] |
(2-C-1) Example of using StaticScraper
from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
f_site = "http://books.toscrape.com/"
f_page = "catalogue/page-{}.html"
f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
url = SSFormat(element_type = "a", extract = "href", nickname = "link")
f_attr = [ price, url ]
info = SSInfo(f_site, f_page, f_item, f_attr)
scraper = StaticScraper(info)
scraper.scrape(start = 1, pages = 15)
[3] Dynamic Scraper
(3-A) DSFormat - Defining the node attributes to scrape:
DSFormat(element_type, **kwargs)
argument | optional | default | available |
---|---|---|---|
xpath | no | [ string type ] | |
relative | yes | False | True, False |
multiple | yes | False | True, False |
extract | yes | None | None, [ function (1-arg) type ] |
format | yes | None | None, [ function (1-arg) type ] |
filter | yes | None | None, [ function (1-arg) type ] |
retry | yes | None | None, [ function (1-arg) type ] |
callback | yes | None | None, [ function (1-arg) type ] |
nickname | yes | None | None, [ string type ] |
keep | yes | True | True, False |
click | yes | False | True, False |
In dynamic scraper, the path to each item/attribute must be provided as x-path.
If the xpath of an attribute is relative to the item (parent), relative must be set to True.
To scrape multiple items, multiple must be set to True.
If we want to extract the href attribute from the a tag, we should set extract to "href".
If we want to format a particular attribute before saving it to file, we should define a function and pass it to the argument format. The following is an example:
from keyscraper.dynamicscraper import DSFormat
def strip_spaces(attribute):
return attribute.strip(" ")
DSFormat(format = strip_spaces)
If we want to filter out items whose attributes don't satisfy a certain condition, we should define a function and pass it to the argument filter. The following is an example:
from keyscraper.dynamicscraper import DSFormat
def filter_prices(price):
price = float(price)
return (price <= 50) # True to keep the item
DSFormat(filter = filter_prices)
In cases where we must wait for a specific item to render, we should define a function and pass it to the argument retry. If this function returns True, the item is saved; otherwise, we wait for it to change. The following is an example:
from keyscraper.dynamicscraper import DSFormat
def retry(attribute):
return (attribute[:4] == "data") # keep trying until False
DSFormat(retry = retry)
In MODE_READ, we may want to add the scraped data to a list; therefore, we should define a function and pass it to the argument callback. The following is an example:
from keyscraper.dynamicscraper import DSFormat
scraped = []
def callback(attribute):
global scraped
scraped.append(attribute)
return attribute
DSFormat(callback = callback)
In the csv file, we can assign a custom column name for each attribute. It can be done by passing a string to the argument nickname.
In cases where some attributes aren't needed further on, we can set keep to False so the column will be dropped when saving to csv file.
If the item/attribute must be clicked before the desired data is available, click should be set to True.
self.__getitem__(key)
argument | optional | default | available |
---|---|---|---|
key | no | "xpath", "relative", "multiple", "extract", "format", "filter", "retry", "callback", "nickname", "keep", "click" |
(3-B) DSInfo - Defining information needed for scraping:
DSInfo(f_site, f_page, f_item, f_attr)
argument | optional | default | available |
---|---|---|---|
f_site | no | [ string type ] | |
f_page | no | [ string type ] | |
f_item | no | [ DSFormat type ] | |
f_attr | no | [ list-DSFormat type ] |
self.__getitem__(key)
argument | optional | default | available |
---|---|---|---|
key | no | "f_site", "f_page", "f_item", "f_attr" |
self.format_page(page)
argument | optional | default | available |
---|---|---|---|
page | no | [ integer/string type ] |
If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".
(3-C) DriverOptions - Defining driver:
DriverOptions(mode = "default", path = None, window = True)
argument | optional | default | available |
---|---|---|---|
mode | yes | DriverOptions.MODE_CHROME | "default", DriverOptions.MODE_CHROME, DriverOptions.MODE_FIREFOX |
path | yes | None | None, [ string type ] |
window | yes | True | True, False |
In order to use dynamic scraper, a (browser) driver must be provided. As of February 6th of 2022, Google Chrome and Mozilla Firefox are supported.
The (file) path to the driver (executable) must be provided. By default, the program will search for the driver in the same folder as it's run in or folders stored in the PATH environment variable.
To download a driver for Google Chrome, visit here.
To download a driver for Mozilla Firefox, visit here.
To hide the browser, set window to False.
(3-D) DynamicScraper - Scraping:
DynamicScraper(info, driveroptions, mode = "default", filename = None, timesleep = 0, buttonPath = None, itemWait = 1, **kwargs)
argument | optional | default | available |
---|---|---|---|
info | no | [ DSInfo type ] | |
driveroptions | no | [ DriverOptions type ] | |
mode | yes | DynamicScraper.MODE_READ | "default", DynamicScraper.MODE_FILE, DynamicScraper.MODE_READ |
filename | yes | None | None, [ string type ] |
timesleep | yes | 0 | [ integer (>=0) type ] |
buttonPath | yes | None | None, [ string type ] |
itemWait | yes | 1 | [ integer/float (>=0) type ] |
buffer | yes | 100 | [ integer (>0) type ] |
There are two modes available for dynamic scraper, MODE_FILE will save the scrape result in a csv file, while MODE_READ will simply scrape the webpage and the data can be accessed in callback.
In MODE_FILE, a filename should be provided. By default, a time name will be generated for the csv file.
To slow down the scraping time, an integer can be passed to the timesleep argument. The scraping of two consecutive pages will be separated by at least timesleep seconds.
In cases where a load-more button exists on a single page, the x-path to that button can be provided to the argument buttonPath.
If each item must be clicked to render its content, a number can be passed to the argument itemWait. Two consecutive item clicks will be separated by at least itemWait seconds.
In MODE_FILE, if we want to save the scrape result once every other 10 items, we should set buffer to 10.
self.scrape(start = 1, pages = 1, perPage = None)
argument | optional | default | available |
---|---|---|---|
start | yes | 1 | [ integer (>0) type ] |
pages | yes | 1 | [ integer (>0) type ] |
perPage | yes | None | None, [ integer (>0) type ] |
The dynamic scraper will scrape pages pages onward from page start.
In cases where there are too many items on each page, we can set perPage to 50 to scrape just 50 items per page.
(3-D-1) Example of using DynamicScraper
from keyscraper.dynamicscraper import DSFormat, DSInfo, DriverOptions, DynamicScraper
f_site = "https://www.ebay.com/sch/"
f_page = "i.html?_nkw=cpu&_pgn={}"
f_item = DSFormat(xpath = "(//li[contains(@class, 's-item s-item__pl-on-bottom s-item--watch-at-corner')])", multiple = True)
price = DSFormat(xpath = "//span[contains(@class, 's-item__price')]", relative = True, extract = "innerHTML", nickname = "price")
url = DSFormat(xpath = "//a[contains(@class, 's-item__link')]", relative = True, extract = "href", nickname = "url")
f_attr = [ price, url ]
driveroptions = DriverOptions(path = "./chromedriver.exe")
info = DSInfo(f_site, f_page, f_item, f_attr)
scraper = DynamicScraper(info, driveroptions, mode = DynamicScraper.MODE_FILE)
scraper.scrape(start = 1, pages = 2, perPage = 5)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file keyscraper-1.1.4.tar.gz
.
File metadata
- Download URL: keyscraper-1.1.4.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d05a68739cabe1a5c75cc51d9ddfa281e337b4fd12aef69b255616078d39cba |
|
MD5 | 9202c6eed9af559ce098290c91c4acff |
|
BLAKE2b-256 | 2060324a71ee69e59785824d192ff4470f8d8eb4a944f2c168ea52178b09190e |
File details
Details for the file keyscraper-1.1.4-py3-none-any.whl
.
File metadata
- Download URL: keyscraper-1.1.4-py3-none-any.whl
- Upload date:
- Size: 20.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc05c4750b8099d98b25157bb5ab387c2439553299f5e3b89d9b4bbd4841b177 |
|
MD5 | 7d0ab7ed70db36fd4d0bca774eecf008 |
|
BLAKE2b-256 | 9adc5a1c79939e4ba8de30eb487ae5a081527d049e9311ba6840eb41c609b0ce |