keyscraper Package Documentation
This library provides various functions which simplifies webpage scraping.
There are three modules in this package.
- utils - basic utilities
- staticscraper - used to scrape raw html data
- dynamicscraper - _used to scrape html data rendered by JavaScript
To install this package, type in command prompt, "pip install keyscraper".
[1] Basic Utilities
(1-A) TimeName - Generating a file name composed of the current time:
TimeName(mode = "default")
argument |
optional |
default |
available |
mode |
yes |
"keywind" |
"keywind", "datetime", "default" |
self.get_name(basename = "", extension = "", f_datetime = None)
argument |
optional |
default |
available |
basename |
yes |
"" |
[ string type ] |
extension |
yes |
"" |
[ string type ] |
f_datetime |
no |
|
[ string type ] |
There are two available modes: "keywind" and "datetime". By default, "keywind" is used.
In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.
Jan. |
Feb. |
Mar. |
Apr. |
May |
Jun. |
Jul. |
Aug. |
Sep. |
Oct. |
Nov. |
Dec. |
i |
f |
m |
a |
M |
j |
J |
A |
s |
o |
n |
d |
For example, on December 7th of 2000, D-d072000 will be the resulting date string.
In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.
For example, at 05:43:07 PM., the resulting time string will be T-174307.
For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.
In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.
(1-A-1) Example of using TimeName (mode: keywind).
- from keyscraper.utils import TimeName
- mode = "keywind" # "keywind" or "datetime"
- name = "images"
- extension = ".jpg"
- timename = TimeName(mode).get_name(name, extension)
- print(timename) # "images_D-d072000_T-012345.jpg"
(1-A-2) Example of using TimeName (mode: datetime).
- from keyscraper.utils import TimeName
- mode = "datetime" # "keywind" or "datetime"
- format_string = "%y%m%d-%H%M%S"
- name = "images"
- extension = ".jpg"
- timename = TimeName(mode).get_name(name, extension, format_string)
- print(timename) # "images_001207-012345.jpg"
(1-B) FileName - Dividing a filename into folder, file and extension:
FileName(filename, mode = "default")
argument |
optional |
default |
available |
filename |
no |
|
[ string type ] |
mode |
yes |
FileName.MODE_FORWARDSLASH |
FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH |
self.__getitem__(key = "all")
argument |
optional |
default |
available |
key |
no |
"all" |
"all", "folder", "name", "extension" |
(1-B-1) Example of using FileName
- from keyscraper.utils import FileName
- mode = FileName.MODE_FORWARDSLASH
- filename = "C:/Users/VIN/Desktop/utils.py"
- name_object = FileName(filename)
- full_name = name_object["all"]
- file_name = name_object["name"]
- folder_name = name_object["folder"]
- extension = name_object["extension"]
- print(full_name, file_name, folder_name, extension)
- # "C:/Users/VIN/Desktop/utils.py utils C:/Users/VIN/Desktop/ .py"
(1-C) FileRetrieve - Downloading a file from a direct URL:
FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)
argument |
optional |
default |
available |
directlink |
no |
|
[ string type ] |
filename |
yes |
|
[ string type ] |
buffer |
yes |
4096 |
[ integer (>0) type ] |
progress_bar |
yes |
False |
True, False |
overwrite |
yes |
None |
None, True, False |
If overwrite is None, the programmer will be asked to enter (Y/N) on each download.
self.simple_retrieve()
Calling this function will download the file from the target URL and save it to disk with the provided filename.
(1-C-1) Example of using FileRetrieve
- from keyscraper.utils import FileRetrieve
- url = " http://www.lenna.org/len_top.jpg "
- filename = "lenna.jpg"
- progress_bar = True
- overwrite = True
- downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
- downloader.simple_retrieve()
(1-D) ImageGrabber - Downloading an image from a direct URL:
ImageGrabber(filename, progressBar = False, url_timeout = None)
argument |
optional |
default |
available |
filename |
no |
|
[ string type ] |
progressBar |
yes |
False |
True, False |
url_timeout |
yes |
600 |
[ integer (>0) type ] |
The URL request will be open for a maximum of url_timeout seconds.
self.retrieve(directlink, overwrite = None, timeout = None)
argument |
optional |
default |
available |
directlink |
no |
|
[ string type ] |
overwrite |
yes |
None |
None, True, False |
timeout |
yes |
None |
None, [ integer (>0) type ] |
If the image hasn't finished downloading in timeout seconds, the process will terminate.
If overwrite is None, the programmer will be asked to enter (Y/N) on each download.
(1-D-1) Example of using ImageGrabber
- from keyscraper.utils import ImageGrabber
- url = " http://www.lenna.org/len_top.jpg "
- filename = "lenna.jpg"
- progressBar = True
- url_timeout = 60
- downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
- downloader.retrieve(url, overwrite = True, timeout = 15)
[2] Static Scraper
(2-A) SSFormat - Defining the node attributes to scrape:
SSFormat(element_type, **kwargs)
argument |
optional |
default |
available |
element_type |
no |
|
[ string type ] |
search_type |
yes |
None |
None, [ string type ] |
search_clue |
yes |
None |
None, [ string type ] |
multiple |
yes |
False |
True, False |
extract |
yes |
None |
None, [ function (1-arg) type ] |
format |
yes |
None |
None, [ function (1-arg) type ] |
nickname |
yes |
None |
None, [ string type ] |
filter |
yes |
None |
None, [ function (1-arg) type ] |
keep |
yes |
True |
True, False |
self.__getitem__(key)
argument |
optional |
default |
available |
key |
no |
|
"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep" |
self.get_value(key)
argument |
optional |
default |
available |
key |
no |
|
"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep" |
(2-B) SSInfo - Defining information needed for scraping:
SSInfo(f_site, f_page, f_item, f_attr)
argument |
optional |
default |
available |
f_site |
no |
|
[ string type ] |
f_page |
no |
|
[ string type ] |
f_item |
no |
|
[ SSFormat type ] |
f_attr |
no |
|
[ list-SSFormat type ] |
self.__getitem__(key)
argument |
optional |
default |
available |
key |
no |
|
"f_site", "f_page", "f_item", "f_attr" |
self.format_page(page)
argument |
optional |
default |
available |
page |
no |
|
[ integer/string type ] |
If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".
(2-C) StaticScraper - Scraping a static webpage:
StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)
argument |
optional |
default |
available |
info |
no |
|
[ SSInfo type ] |
filename |
yes |
None |
None, [ string type ] |
mode |
yes |
StaticScraper.MODE_FILE |
StaticScraper.MODE_FILE, StaticScraper.MODE_READ |
timesleep |
yes |
0 |
[ integer/float (>=0) type ] |
buffer |
yes |
100 |
[ integer (>0) type ] |
self.scrape(start = 1, pages = 1)
argument |
optional |
default |
available |
start |
yes |
1 |
[ integer (>0) type ] |
pages |
yes |
1 |
[ integer (>0) type ] |
(2-C-1) Example of using StaticScraper
- from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
- f_site = " http://books.toscrape.com/catalogue/ "
- f_page = "page-{}.html"
- f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
- f_price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
- f_url = SSFormat(element_type = "a", extract = "href", nickname = "link")
- f_attr = [ f_price, f_url ]
- info = SSInfo(f_site, f_page, f_item, f_attr)
- scraper = StaticScraper(info)
- scraper.scrape(start = 1, pages = 15)