A library for web scraping.

These details have not been verified by PyPI

Project links

Project description

keyscraper Package Documentation

This library provides various functions which simplifies webpage scraping.

There are three modules in this package.

utils - basic utilities
staticscraper - used to scrape raw html data
dynamicscraper - _used to scrape html data rendered by JavaScript

To install this package, type in command prompt, "pip install keyscraper".

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")

argument	optional	default	available
mode	yes	"keywind"	"keywind", "datetime", "default"

self.get_name(basename = "", extension = "", f_datetime = None)

argument	optional	default	available
basename	yes	""	[ string type ]
extension	yes	""	[ string type ]
f_datetime	no		[ string type ]

There are two available modes: "keywind" and "datetime". By default, "keywind" is used.

In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.

Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
i	f	m	a	M	j	J	A	s	o	n	d

For example, on December 7th of 2000, D-d072000 will be the resulting date string.

In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.

For example, at 05:43:07 PM., the resulting time string will be T-174307.

For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.

In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.

(1-A-1) Example of using TimeName (mode: keywind).

from keyscraper.utils import TimeName
mode = "keywind" # "keywind" or "datetime"
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension)
print(timename) # "images_D-d072000_T-012345.jpg"

(1-A-2) Example of using TimeName (mode: datetime).

from keyscraper.utils import TimeName
mode = "datetime" # "keywind" or "datetime"
format_string = "%y%m%d-%H%M%S"
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension, format_string)
print(timename) # "images_001207-012345.jpg"

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")

argument	optional	default	available
filename	no		[ string type ]
mode	yes	FileName.MODE_FORWARDSLASH	FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH

self.getitem(key = "all")

argument	optional	default	available
key	no	"all"	"all", "folder", "name", "extension"

(1-B-1) Example of using FileName

from keyscraper.utils import FileName
mode = FileName.MODE_FORWARDSLASH
filename = "C:/Users/VIN/Desktop/utils.py"
name_object = FileName(filename)
full_name = name_object["all"]
file_name = name_object["name"]
folder_name = name_object["folder"]
extension = name_object["extension"]
print(full_name, file_name, folder_name, extension)
# "C:/Users/VIN/Desktop/utils.py utils C:/Users/VIN/Desktop/ .py"

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)

argument	optional	default	available
directlink	no		[ string type ]
filename	yes		[ string type ]
buffer	yes	4096	[ integer (>0) type ]
progress_bar	yes	False	True, False
overwrite	yes	None	None, True, False

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

self.simple_retrieve()

Calling this function will download the file from the target URL and save it to disk with the provided filename.

(1-C-1) Example of using FileRetrieve

from keyscraper.utils import FileRetrieve
url = " http://www.lenna.org/len_top.jpg "
filename = "lenna.jpg"
progress_bar = True
overwrite = True
downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
downloader.simple_retrieve()

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)

argument	optional	default	available
filename	no		[ string type ]
progressBar	yes	False	True, False
url_timeout	yes	600	[ integer (>0) type ]

The URL request will be open for a maximum of url_timeout seconds.

self.retrieve(directlink, overwrite = None, timeout = None)

argument	optional	default	available
directlink	no		[ string type ]
overwrite	yes	None	None, True, False
timeout	yes	None	None, [ integer (>0) type ]

If the image hasn't finished downloading in timeout seconds, the process will terminate.

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

(1-D-1) Example of using ImageGrabber

from keyscraper.utils import ImageGrabber
url = " http://www.lenna.org/len_top.jpg "
filename = "lenna.jpg"
progressBar = True
url_timeout = 60
downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
downloader.retrieve(url, overwrite = True, timeout = 15)

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)

argument	optional	default	available
element_type	no		[ string type ]
search_type	yes	None	None, [ string type ]
search_clue	yes	None	None, [ string type ]
multiple	yes	False	True, False
extract	yes	None	None, [ function (1-arg) type ]
format	yes	None	None, [ function (1-arg) type ]
nickname	yes	None	None, [ string type ]
filter	yes	None	None, [ function (1-arg) type ]
keep	yes	True	True, False

self.getitem(key)

argument	optional	default	available
key	no		"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

self.get_value(key)

argument	optional	default	available
key	no		"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)

argument	optional	available
f_site	no	[ string type ]
f_page	no	[ string type ]
f_item	no	[ SSFormat type ]
f_attr	no	[ list-SSFormat type ]

self.getitem(key)

argument	optional	default	available
key	no		"f_site", "f_page", "f_item", "f_attr"

self.format_page(page)

argument	optional	default	available
page	no		[ integer/string type ]

If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)

argument	optional	default	available
info	no		[ SSInfo type ]
filename	yes	None	None, [ string type ]
mode	yes	StaticScraper.MODE_FILE	StaticScraper.MODE_FILE, StaticScraper.MODE_READ
timesleep	yes	0	[ integer/float (>=0) type ]
buffer	yes	100	[ integer (>0) type ]

self.scrape(start = 1, pages = 1)

argument	optional	default	available
start	yes	1	[ integer (>0) type ]
pages	yes	1	[ integer (>0) type ]

(2-C-1) Example of using StaticScraper

from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
f_site = " http://books.toscrape.com/catalogue/ "
f_page = "page-{}.html"
f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
f_price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
f_url = SSFormat(element_type = "a", extract = "href", nickname = "link")
f_attr = [ f_price, f_url ]
info = SSInfo(f_site, f_page, f_item, f_attr)
scraper = StaticScraper(info)
scraper.scrape(start = 1, pages = 15)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.4

Feb 6, 2022

This version

1.1.3

Feb 6, 2022

1.1.2

Dec 11, 2021

1.1.1

Dec 11, 2021

1.1.0

Dec 11, 2021

1.0.3

Dec 10, 2021

1.0.2 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Cleanup

1.0.1 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Attribute Safeguard

1.0.0 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Import Issues

0.0.0 yanked

Dec 7, 2021

Reason this release was yanked:

Pesticide

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyscraper-1.1.3.tar.gz (24.3 kB view hashes)

Uploaded Feb 6, 2022 Source

Built Distribution

keyscraper-1.1.3-py3-none-any.whl (24.2 kB view hashes)

Uploaded Feb 6, 2022 Python 3

Hashes for keyscraper-1.1.3.tar.gz

Hashes for keyscraper-1.1.3.tar.gz
Algorithm	Hash digest
SHA256	`c285b023592a778c48929c00eaca1a240ef5fa13f3ba98ca8b5eb7d99f4370fe`
MD5	`35f5a7e903354f356574fe6bbbf8f610`
BLAKE2b-256	`baf717f220ce8d9f225643ac535b2c0e40706e6b12a958756550d6f735485135`

Hashes for keyscraper-1.1.3-py3-none-any.whl

Hashes for keyscraper-1.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd0bfd2465b54efa47722d98106cae56bb50351e7dde6175aefeb103b7eba480`
MD5	`8a5e9db356383432c86ad4ed85c521ba`
BLAKE2b-256	`903398dd872fd2945ff5951290481e0b46c2002c659f6bb507533162f78ef871`

keyscraper 1.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

keyscraper Package Documentation

This library provides various functions which simplifies webpage scraping.

There are three modules in this package.

To install this package, type in command prompt, "pip install keyscraper".

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")

self.get_name(basename = "", extension = "", f_datetime = None)

(1-A-1) Example of using TimeName (mode: keywind).

(1-A-2) Example of using TimeName (mode: datetime).

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")

self.__getitem__(key = "all")

(1-B-1) Example of using FileName

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)

self.simple_retrieve()

(1-C-1) Example of using FileRetrieve

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)

self.retrieve(directlink, overwrite = None, timeout = None)

(1-D-1) Example of using ImageGrabber

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)

self.__getitem__(key)

self.get_value(key)

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)

self.__getitem__(key)

self.format_page(page)

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)

self.scrape(start = 1, pages = 1)

(2-C-1) Example of using StaticScraper

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

self.getitem(key = "all")

self.getitem(key)

self.getitem(key)