Skip to main content

A library for web scraping.

Project description

keyscraper Package Documentation


This library provides various functions which simplifies webpage scraping.
There are three modules in this package.
  1. utils - basic utilities
  2. staticscraper - used to scrape raw html data
  3. dynamicscraper - used to scrape html data rendered by JavaScript
To install this package, type in command prompt:
pip install keyscraper

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")
argument optional default available
mode yes TimeName.MODE_KEYWIND TimeName.MODE_KEYWIND, TimeName.MODE_DATETIME, "default"
self.get_name(basename = "", extension = "", f_datetime = None)
argument optional default available
basename yes "" [ string type ]
extension yes "" [ string type ]
f_datetime no [ string type ]

There are two available modes: "keywind" and "datetime". By default, "keywind" is used.

In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.

Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
i f m a M j J A s o n d

For example, on December 7th of 2000, D-d072000 will be the resulting date string.

In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.

For example, at 05:43:07 PM., the resulting time string will be T-174307.

For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.

In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.

(1-A-1) Example of using TimeName (mode: keywind).
from keyscraper.utils import TimeName
mode = TimeName.MODE_KEYWIND # or TimeName.MODE_DATETIME
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension)
print(timename) # "images_D-d072000_T-012345.jpg"
(1-A-2) Example of using TimeName (mode: datetime).
from keyscraper.utils import TimeName
mode = TimeName.MODE_DATETIME # or TimeName.MODE_KEYWIND
format_string = "_%y%m%d-%H%M%S"
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension, format_string)
print(timename) # "images_001207-012345.jpg"

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")
argument optional default available
filename no [ string type ]
mode yes FileName.MODE_FORWARDSLASH FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH
self.__getitem__(key = "all")
argument optional default available
key no "all" "all", "folder", "name", "extension"
(1-B-1) Example of using FileName
from keyscraper.utils import FileName
mode = FileName.MODE_FORWARDSLASH
filename = "C:/Users/VIN/Desktop/utils.py"
name_object = FileName(filename)
full_name = name_object["all"]
file_name = name_object["name"]
folder_name = name_object["folder"]
extension = name_object["extension"]
print(full_name) # "C:/Users/VIN/Desktop/utils.py"
print(folder_name) # "C:/Users/VIN/Desktop/"
print(file_name) # "utils"
print(extension) # ".py"

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)
argument optional default available
directlink no [ string type ]
filename yes [ string type ]
buffer yes 4096 [ integer (>0) type ]
progress_bar yes False True, False
overwrite yes None None, True, False

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

self.simple_retrieve()

Calling this function will download the file from the target URL and save it to disk with the provided filename.

(1-C-1) Example of using FileRetrieve
from keyscraper.utils import FileRetrieve
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progress_bar = True
overwrite = True
downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
downloader.simple_retrieve()

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)
argument optional default available
filename no [ string type ]
progressBar yes False True, False
url_timeout yes 600 [ integer (>0) type ]

The URL request will be open for a maximum of url_timeout seconds.

self.retrieve(directlink, overwrite = None, timeout = None)
argument optional default available
directlink no [ string type ]
overwrite yes None None, True, False
timeout yes None None, [ integer (>0) type ]

If the image hasn't finished downloading in timeout seconds, the process will terminate.

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

(1-D-1) Example of using ImageGrabber
from keyscraper.utils import ImageGrabber
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progressBar = True
url_timeout = 60
downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
downloader.retrieve(url, overwrite = True, timeout = 15)

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)
argument optional default available
element_type no [ string type ]
search_type yes None None, [ string type ]
search_clue yes None None, [ string type ]
multiple yes False True, False
extract yes None None, [ function (1-arg) type ]
format yes None None, [ function (1-arg) type ]
nickname yes None None, [ string type ]
filter yes None None, [ function (1-arg) type ]
keep yes True True, False
self.__getitem__(key)
argument optional default available
key no "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"
self.get_value(key)
argument optional default available
key no "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)
argument optional default available
f_site no [ string type ]
f_page no [ string type ]
f_item no [ SSFormat type ]
f_attr no [ list-SSFormat type ]
self.__getitem__(key)
argument optional default available
key no "f_site", "f_page", "f_item", "f_attr"
self.format_page(page)
argument optional default available
page no [ integer/string type ]

If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)
argument optional default available
info no [ SSInfo type ]
filename yes None None, [ string type ]
mode yes StaticScraper.MODE_FILE StaticScraper.MODE_FILE, StaticScraper.MODE_READ
timesleep yes 0 [ integer/float (>=0) type ]
buffer yes 100 [ integer (>0) type ]
self.scrape(start = 1, pages = 1)
argument optional default available
start yes 1 [ integer (>0) type ]
pages yes 1 [ integer (>0) type ]
(2-C-1) Example of using StaticScraper
from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
f_site = "http://books.toscrape.com/"
f_page = "catalogue/page-{}.html"
f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
url = SSFormat(element_type = "a", extract = "href", nickname = "link")
f_attr = [ price, url ]
info = SSInfo(f_site, f_page, f_item, f_attr)
scraper = StaticScraper(info)
scraper.scrape(start = 1, pages = 15)

[3] Dynamic Scraper

(3-A) DSFormat - Defining the node attributes to scrape:

DSFormat(element_type, **kwargs)
argument optional default available
xpath no [ string type ]
relative yes False True, False
multiple yes False True, False
extract yes None None, [ function (1-arg) type ]
format yes None None, [ function (1-arg) type ]
filter yes None None, [ function (1-arg) type ]
retry yes None None, [ function (1-arg) type ]
callback yes None None, [ function (1-arg) type ]
nickname yes None None, [ string type ]
keep yes True True, False
click yes False True, False

In dynamic scraper, the path to each item/attribute must be provided as x-path.

If the xpath of an attribute is relative to the item (parent), relative must be set to True.

To scrape multiple items, multiple must be set to True.

If we want to extract the href attribute from the a tag, we should set extract to "href".

If we want to format a particular attribute before saving it to file, we should define a function and pass it to the argument format. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def strip_spaces(attribute):
    return attribute.strip(" ")
DSFormat(format = strip_spaces)

If we want to filter out items whose attributes don't satisfy a certain condition, we should define a function and pass it to the argument filter. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def filter_prices(price):
    price = float(price)
    return (price <= 50) # True to keep the item
DSFormat(filter = filter_prices)

In cases where we must wait for a specific item to render, we should define a function and pass it to the argument retry. If this function returns True, the item is saved; otherwise, we wait for it to change. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def retry(attribute):
  return (attribute[:4] == "data") # keep trying until False
DSFormat(retry = retry)

In MODE_READ, we may want to add the scraped data to a list; therefore, we should define a function and pass it to the argument callback. The following is an example:

from keyscraper.dynamicscraper import DSFormat
scraped = []
def callback(attribute):
    global scraped
    scraped.append(attribute)
    return attribute
DSFormat(callback = callback)

In the csv file, we can assign a custom column name for each attribute. It can be done by passing a string to the argument nickname.

In cases where some attributes aren't needed further on, we can set keep to False so the column will be dropped when saving to csv file.

If the item/attribute must be clicked before the desired data is available, click should be set to True.

self.__getitem__(key)
argument optional default available
key no "xpath", "relative", "multiple", "extract", "format", "filter", "retry", "callback", "nickname", "keep", "click"

(3-B) DSInfo - Defining information needed for scraping:

DSInfo(f_site, f_page, f_item, f_attr)
argument optional default available
f_site no [ string type ]
f_page no [ string type ]
f_item no [ DSFormat type ]
f_attr no [ list-DSFormat type ]
self.__getitem__(key)
argument optional default available
key no "f_site", "f_page", "f_item", "f_attr"
self.format_page(page)
argument optional default available
page no [ integer/string type ]

If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".

(3-C) DriverOptions - Defining driver:

DriverOptions(mode = "default", path = None, window = True)
argument optional default available
mode yes DriverOptions.MODE_CHROME "default", DriverOptions.MODE_CHROME, DriverOptions.MODE_FIREFOX
path yes None None, [ string type ]
window yes True True, False

In order to use dynamic scraper, a (browser) driver must be provided. As of February 6th of 2022, Google Chrome and Mozilla Firefox are supported.

The (file) path to the driver (executable) must be provided. By default, the program will search for the driver in the same folder as it's run in or folders stored in the PATH environment variable.

To download a driver for Google Chrome, visit here.

To download a driver for Mozilla Firefox, visit here.

To hide the browser, set window to False.

(3-D) DynamicScraper - Scraping:

DynamicScraper(info, driveroptions, mode = "default", filename = None, timesleep = 0, buttonPath = None, itemWait = 1, **kwargs)
argument optional default available
info no [ DSInfo type ]
driveroptions no [ DriverOptions type ]
mode yes DynamicScraper.MODE_READ "default", DynamicScraper.MODE_FILE, DynamicScraper.MODE_READ
filename yes None None, [ string type ]
timesleep yes 0 [ integer (>=0) type ]
buttonPath yes None None, [ string type ]
itemWait yes 1 [ integer/float (>=0) type ]
buffer yes 100 [ integer (>0) type ]

There are two modes available for dynamic scraper, MODE_FILE will save the scrape result in a csv file, while MODE_READ will simply scrape the webpage and the data can be accessed in callback.

In MODE_FILE, a filename should be provided. By default, a time name will be generated for the csv file.

To slow down the scraping time, an integer can be passed to the timesleep argument. The scraping of two consecutive pages will be separated by at least timesleep seconds.

In cases where a load-more button exists on a single page, the x-path to that button can be provided to the argument buttonPath.

If each item must be clicked to render its content, a number can be passed to the argument itemWait. Two consecutive item clicks will be separated by at least itemWait seconds.

In MODE_FILE, if we want to save the scrape result once every other 10 items, we should set buffer to 10.

self.scrape(start = 1, pages = 1, perPage = None)
argument optional default available
start yes 1 [ integer (>0) type ]
pages yes 1 [ integer (>0) type ]
perPage yes None None, [ integer (>0) type ]

The dynamic scraper will scrape pages pages onward from page start.

In cases where there are too many items on each page, we can set perPage to 50 to scrape just 50 items per page.

(3-D-1) Example of using DynamicScraper
from keyscraper.dynamicscraper import DSFormat, DSInfo, DriverOptions, DynamicScraper
f_site = "https://www.ebay.com/sch/"
f_page = "i.html?_nkw=cpu&_pgn={}"
f_item = DSFormat(xpath = "(//li[contains(@class, 's-item s-item__pl-on-bottom s-item--watch-at-corner')])", multiple = True)
price = DSFormat(xpath = "//span[contains(@class, 's-item__price')]", relative = True, extract = "innerHTML", nickname = "price")
url = DSFormat(xpath = "//a[contains(@class, 's-item__link')]", relative = True, extract = "href", nickname = "url")
f_attr = [ price, url ]
driveroptions = DriverOptions(path = "./chromedriver.exe")
info = DSInfo(f_site, f_page, f_item, f_attr)
scraper = DynamicScraper(info, driveroptions, mode = DynamicScraper.MODE_FILE)
scraper.scrape(start = 1, pages = 2, perPage = 5)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyscraper-1.1.4.tar.gz (21.5 kB view hashes)

Uploaded Source

Built Distribution

keyscraper-1.1.4-py3-none-any.whl (20.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page