Skip to main content

A library for web scraping.

Project description

keyscraper Package Documentation


This library provides various functions which simplifies webpage scraping.
There are three modules in this package.
  1. utils - basic utilities
  2. staticscraper - used to scrape raw html data
  3. dynamicscraper - _used to scrape html data rendered by JavaScript
To install this package, type in command prompt, "pip install keyscraper".

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")
argument optional default available
mode yes "keywind" "keywind", "datetime", "default"
self.get_name(basename = "", extension = "", f_datetime = None)
argument optional default available
basename yes "" [ string type ]
extension yes "" [ string type ]
f_datetime no [ string type ]

There are two available modes: "keywind" and "datetime". By default, "keywind" is used.

In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.

Jan. Feb. Mar. Apr. May Jun. Jul. Aug. Sep. Oct. Nov. Dec.
i f m a M j J A s o n d

For example, on December 7th of 2000, D-d072000 will be the resulting date string.

In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.

For example, at 05:43:07 PM., the resulting time string will be T-174307.

For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.

In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.

(1-A-1) Example of using TimeName (mode: keywind).
  1. from keyscraper.utils import TimeName
  2. mode = "keywind" # "keywind" or "datetime"
  3. name = "images"
  4. extension = ".jpg"
  5. timename = TimeName(mode).get_name(name, extension)
  6. print(timename) # "images_D-d072000_T-012345.jpg"
(1-A-2) Example of using TimeName (mode: datetime).
  1. from keyscraper.utils import TimeName
  2. mode = "datetime" # "keywind" or "datetime"
  3. format_string = "%y%m%d-%H%M%S"
  4. name = "images"
  5. extension = ".jpg"
  6. timename = TimeName(mode).get_name(name, extension, format_string)
  7. print(timename) # "images_001207-012345.jpg"

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")
argument optional default available
filename no [ string type ]
mode yes FileName.MODE_FORWARDSLASH FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH
self.__getitem__(key = "all")
argument optional default available
key no "all" "all", "folder", "name", "extension"
(1-B-1) Example of using FileName
  1. from keyscraper.utils import FileName
  2. mode = FileName.MODE_FORWARDSLASH
  3. filename = "C:/Users/VIN/Desktop/utils.py"
  4. name_object = FileName(filename)
  5. full_name = name_object["all"]
  6. file_name = name_object["name"]
  7. folder_name = name_object["folder"]
  8. extension = name_object["extension"]
  9. print(full_name, file_name, folder_name, extension)
  10. # "C:/Users/VIN/Desktop/utils.py utils C:/Users/VIN/Desktop/ .py"

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)
argument optional default available
directlink no [ string type ]
filename yes [ string type ]
buffer yes 4096 [ integer (>0) type ]
progress_bar yes False True, False
overwrite yes None None, True, False

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

self.simple_retrieve()

Calling this function will download the file from the target URL and save it to disk with the provided filename.

(1-C-1) Example of using FileRetrieve
  1. from keyscraper.utils import FileRetrieve
  2. url = " http://www.lenna.org/len_top.jpg "
  3. filename = "lenna.jpg"
  4. progress_bar = True
  5. overwrite = True
  6. downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
  7. downloader.simple_retrieve()

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)
argument optional default available
filename no [ string type ]
progressBar yes False True, False
url_timeout yes 600 [ integer (>0) type ]

The URL request will be open for a maximum of url_timeout seconds.

self.retrieve(directlink, overwrite = None, timeout = None)
argument optional default available
directlink no [ string type ]
overwrite yes None None, True, False
timeout yes None None, [ integer (>0) type ]

If the image hasn't finished downloading in timeout seconds, the process will terminate.

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

(1-D-1) Example of using ImageGrabber
  1. from keyscraper.utils import ImageGrabber
  2. url = " http://www.lenna.org/len_top.jpg "
  3. filename = "lenna.jpg"
  4. progressBar = True
  5. url_timeout = 60
  6. downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
  7. downloader.retrieve(url, overwrite = True, timeout = 15)

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)
argument optional default available
element_type no [ string type ]
search_type yes None None, [ string type ]
search_clue yes None None, [ string type ]
multiple yes False True, False
extract yes None None, [ function (1-arg) type ]
format yes None None, [ function (1-arg) type ]
nickname yes None None, [ string type ]
filter yes None None, [ function (1-arg) type ]
keep yes True True, False
self.__getitem__(key)
argument optional default available
key no "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"
self.get_value(key)
argument optional default available
key no "element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)
argument optional default available
f_site no [ string type ]
f_page no [ string type ]
f_item no [ SSFormat type ]
f_attr no [ list-SSFormat type ]
self.__getitem__(key)
argument optional default available
key no "f_site", "f_page", "f_item", "f_attr"
self.format_page(page)
argument optional default available
page no [ integer/string type ]

If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)
argument optional default available
info no [ SSInfo type ]
filename yes None None, [ string type ]
mode yes StaticScraper.MODE_FILE StaticScraper.MODE_FILE, StaticScraper.MODE_READ
timesleep yes 0 [ integer/float (>=0) type ]
buffer yes 100 [ integer (>0) type ]
self.scrape(start = 1, pages = 1)
argument optional default available
start yes 1 [ integer (>0) type ]
pages yes 1 [ integer (>0) type ]
(2-C-1) Example of using StaticScraper
  1. from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
  2. f_site = " http://books.toscrape.com/catalogue/ "
  3. f_page = "page-{}.html"
  4. f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
  5. f_price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
  6. f_url = SSFormat(element_type = "a", extract = "href", nickname = "link")
  7. f_attr = [ f_price, f_url ]
  8. info = SSInfo(f_site, f_page, f_item, f_attr)
  9. scraper = StaticScraper(info)
  10. scraper.scrape(start = 1, pages = 15)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyscraper-1.1.3.tar.gz (24.3 kB view hashes)

Uploaded Source

Built Distribution

keyscraper-1.1.3-py3-none-any.whl (24.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page