A library for web scraping.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

keyscraper Package Documentation

This library provides various functions which simplifies webpage scraping.

There are three modules in this package.

utils - basic utilities
staticscraper - used to scrape raw html data
dynamicscraper - used to scrape html data rendered by JavaScript

To install this package, type in command prompt:

pip install keyscraper

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")

argument	optional	default	available
mode	yes	TimeName.MODE_KEYWIND	TimeName.MODE_KEYWIND, TimeName.MODE_DATETIME, "default"

self.get_name(basename = "", extension = "", f_datetime = None)

argument	optional	default	available
basename	yes	""	[ string type ]
extension	yes	""	[ string type ]
f_datetime	no		[ string type ]

There are two available modes: "keywind" and "datetime". By default, "keywind" is used.

In mode "keywind", the date is formatted as D-{month}{day}{year} where {month} consists of a single character, {day} is a 2-digit number ranging from 01 to 31 and {year} is a 4-digit number such as 2000.

Jan.	Feb.	Mar.	Apr.	May	Jun.	Jul.	Aug.	Sep.	Oct.	Nov.	Dec.
i	f	m	a	M	j	J	A	s	o	n	d

For example, on December 7th of 2000, D-d072000 will be the resulting date string.

In mode "keywind", the time is formatted as T-{hour}{minute}{second} where {hour} consists of a 2-digit number ranging from 00 to 23, both {minute} and {second} are a 2-digit number ranging from 00 to 59.

For example, at 05:43:07 PM., the resulting time string will be T-174307.

For example, at 01:23:45 AM. on April 26th, 1986, the resulting string will be {basename}_D-a261986_T-012345{extension}.

In mode "datetime", the programmer must pass a strftime string. The complete documentation to datetime formatting is linked here.

(1-A-1) Example of using TimeName (mode: keywind).

from keyscraper.utils import TimeName
mode = TimeName.MODE_KEYWIND # or TimeName.MODE_DATETIME
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension)
print(timename) # "images_D-d072000_T-012345.jpg"

(1-A-2) Example of using TimeName (mode: datetime).

from keyscraper.utils import TimeName
mode = TimeName.MODE_DATETIME # or TimeName.MODE_KEYWIND
format_string = "_%y%m%d-%H%M%S"
name = "images"
extension = ".jpg"
timename = TimeName(mode).get_name(name, extension, format_string)
print(timename) # "images_001207-012345.jpg"

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")

argument	optional	default	available
filename	no		[ string type ]
mode	yes	FileName.MODE_FORWARDSLASH	FileName.MODE_FORWARDSLASH, FileName.MODE_BACKWARDSLASH

self.getitem(key = "all")

argument	optional	default	available
key	no	"all"	"all", "folder", "name", "extension"

(1-B-1) Example of using FileName

from keyscraper.utils import FileName
mode = FileName.MODE_FORWARDSLASH
filename = "C:/Users/VIN/Desktop/utils.py"
name_object = FileName(filename)
full_name = name_object["all"]
file_name = name_object["name"]
folder_name = name_object["folder"]
extension = name_object["extension"]
print(full_name) # "C:/Users/VIN/Desktop/utils.py"
print(folder_name) # "C:/Users/VIN/Desktop/"
print(file_name) # "utils"
print(extension) # ".py"

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)

argument	optional	default	available
directlink	no		[ string type ]
filename	yes		[ string type ]
buffer	yes	4096	[ integer (>0) type ]
progress_bar	yes	False	True, False
overwrite	yes	None	None, True, False

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

self.simple_retrieve()

Calling this function will download the file from the target URL and save it to disk with the provided filename.

(1-C-1) Example of using FileRetrieve

from keyscraper.utils import FileRetrieve
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progress_bar = True
overwrite = True
downloader = FileRetrieve(url, filename = filename, progress_bar = progress_bar, overwrite = overwrite)
downloader.simple_retrieve()

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)

argument	optional	default	available
filename	no		[ string type ]
progressBar	yes	False	True, False
url_timeout	yes	600	[ integer (>0) type ]

The URL request will be open for a maximum of url_timeout seconds.

self.retrieve(directlink, overwrite = None, timeout = None)

argument	optional	default	available
directlink	no		[ string type ]
overwrite	yes	None	None, True, False
timeout	yes	None	None, [ integer (>0) type ]

If the image hasn't finished downloading in timeout seconds, the process will terminate.

If overwrite is None, the programmer will be asked to enter (Y/N) on each download.

(1-D-1) Example of using ImageGrabber

from keyscraper.utils import ImageGrabber
url = "http://www.lenna.org/len_top.jpg"
filename = "lenna.jpg"
progressBar = True
url_timeout = 60
downloader = ImageGrabber(filename, progressBar = progressBar, url_timeout = url_timeout)
downloader.retrieve(url, overwrite = True, timeout = 15)

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)

argument	optional	default	available
element_type	no		[ string type ]
search_type	yes	None	None, [ string type ]
search_clue	yes	None	None, [ string type ]
multiple	yes	False	True, False
extract	yes	None	None, [ function (1-arg) type ]
format	yes	None	None, [ function (1-arg) type ]
nickname	yes	None	None, [ string type ]
filter	yes	None	None, [ function (1-arg) type ]
keep	yes	True	True, False

self.getitem(key)

argument	optional	default	available
key	no		"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

self.get_value(key)

argument	optional	default	available
key	no		"element_type", "search_type", "search_clue", "multiple", "extract", "format", "nickname", "filter", "keep"

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)

argument	optional	available
f_site	no	[ string type ]
f_page	no	[ string type ]
f_item	no	[ SSFormat type ]
f_attr	no	[ list-SSFormat type ]

self.getitem(key)

argument	optional	default	available
key	no		"f_site", "f_page", "f_item", "f_attr"

self.format_page(page)

argument	optional	default	available
page	no		[ integer/string type ]

If f_page is not an empty string, page is put into f_page inside curly braces. For instance, if f_page = "page-{}.html" and page = 5, this function will return "page-5.html". On the contrary, if f_page = "", the function will return "".

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)

argument	optional	default	available
info	no		[ SSInfo type ]
filename	yes	None	None, [ string type ]
mode	yes	StaticScraper.MODE_FILE	StaticScraper.MODE_FILE, StaticScraper.MODE_READ
timesleep	yes	0	[ integer/float (>=0) type ]
buffer	yes	100	[ integer (>0) type ]

self.scrape(start = 1, pages = 1)

argument	optional	default	available
start	yes	1	[ integer (>0) type ]
pages	yes	1	[ integer (>0) type ]

(2-C-1) Example of using StaticScraper

from keyscraper.staticscraper import SSFormat, SSInfo, StaticScraper
f_site = "http://books.toscrape.com/"
f_page = "catalogue/page-{}.html"
f_item = SSFormat(element_type = "li", search_type = "class_", search_clue = "col-xs-6 col-sm-4 col-md-3 col-lg-3", multiple = True)
price = SSFormat(element_type = "p", search_type = "class_", search_clue = "price_color", extract = "text", nickname = "price")
url = SSFormat(element_type = "a", extract = "href", nickname = "link")
f_attr = [ price, url ]
info = SSInfo(f_site, f_page, f_item, f_attr)
scraper = StaticScraper(info)
scraper.scrape(start = 1, pages = 15)

[3] Dynamic Scraper

(3-A) DSFormat - Defining the node attributes to scrape:

DSFormat(element_type, **kwargs)

argument	optional	default	available
xpath	no		[ string type ]
relative	yes	False	True, False
multiple	yes	False	True, False
extract	yes	None	None, [ function (1-arg) type ]
format	yes	None	None, [ function (1-arg) type ]
filter	yes	None	None, [ function (1-arg) type ]
retry	yes	None	None, [ function (1-arg) type ]
callback	yes	None	None, [ function (1-arg) type ]
nickname	yes	None	None, [ string type ]
keep	yes	True	True, False
click	yes	False	True, False

In dynamic scraper, the path to each item/attribute must be provided as x-path.

If the xpath of an attribute is relative to the item (parent), relative must be set to True.

To scrape multiple items, multiple must be set to True.

If we want to extract the href attribute from the a tag, we should set extract to "href".

If we want to format a particular attribute before saving it to file, we should define a function and pass it to the argument format. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def strip_spaces(attribute):
    return attribute.strip(" ")
DSFormat(format = strip_spaces)

If we want to filter out items whose attributes don't satisfy a certain condition, we should define a function and pass it to the argument filter. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def filter_prices(price):
    price = float(price)
    return (price <= 50) # True to keep the item
DSFormat(filter = filter_prices)

In cases where we must wait for a specific item to render, we should define a function and pass it to the argument retry. If this function returns True, the item is saved; otherwise, we wait for it to change. The following is an example:

from keyscraper.dynamicscraper import DSFormat
def retry(attribute):
  return (attribute[:4] == "data") # keep trying until False
DSFormat(retry = retry)

In MODE_READ, we may want to add the scraped data to a list; therefore, we should define a function and pass it to the argument callback. The following is an example:

from keyscraper.dynamicscraper import DSFormat
scraped = []
def callback(attribute):
    global scraped
    scraped.append(attribute)
    return attribute
DSFormat(callback = callback)

In the csv file, we can assign a custom column name for each attribute. It can be done by passing a string to the argument nickname.

In cases where some attributes aren't needed further on, we can set keep to False so the column will be dropped when saving to csv file.

If the item/attribute must be clicked before the desired data is available, click should be set to True.

self.getitem(key)

argument	optional	default	available
key	no		"xpath", "relative", "multiple", "extract", "format", "filter", "retry", "callback", "nickname", "keep", "click"

(3-B) DSInfo - Defining information needed for scraping:

DSInfo(f_site, f_page, f_item, f_attr)

argument	optional	available
f_site	no	[ string type ]
f_page	no	[ string type ]
f_item	no	[ DSFormat type ]
f_attr	no	[ list-DSFormat type ]

self.getitem(key)

argument	optional	default	available
key	no		"f_site", "f_page", "f_item", "f_attr"

self.format_page(page)

argument	optional	default	available
page	no		[ integer/string type ]

(3-C) DriverOptions - Defining driver:

DriverOptions(mode = "default", path = None, window = True)

argument	optional	default	available
mode	yes	DriverOptions.MODE_CHROME	"default", DriverOptions.MODE_CHROME, DriverOptions.MODE_FIREFOX
path	yes	None	None, [ string type ]
window	yes	True	True, False

In order to use dynamic scraper, a (browser) driver must be provided. As of February 6th of 2022, Google Chrome and Mozilla Firefox are supported.

The (file) path to the driver (executable) must be provided. By default, the program will search for the driver in the same folder as it's run in or folders stored in the PATH environment variable.

To download a driver for Google Chrome, visit here.

To download a driver for Mozilla Firefox, visit here.

To hide the browser, set window to False.

(3-D) DynamicScraper - Scraping:

DynamicScraper(info, driveroptions, mode = "default", filename = None, timesleep = 0, buttonPath = None, itemWait = 1, **kwargs)

argument	optional	default	available
info	no		[ DSInfo type ]
driveroptions	no		[ DriverOptions type ]
mode	yes	DynamicScraper.MODE_READ	"default", DynamicScraper.MODE_FILE, DynamicScraper.MODE_READ
filename	yes	None	None, [ string type ]
timesleep	yes	0	[ integer (>=0) type ]
buttonPath	yes	None	None, [ string type ]
itemWait	yes	1	[ integer/float (>=0) type ]
buffer	yes	100	[ integer (>0) type ]

There are two modes available for dynamic scraper, MODE_FILE will save the scrape result in a csv file, while MODE_READ will simply scrape the webpage and the data can be accessed in callback.

In MODE_FILE, a filename should be provided. By default, a time name will be generated for the csv file.

To slow down the scraping time, an integer can be passed to the timesleep argument. The scraping of two consecutive pages will be separated by at least timesleep seconds.

In cases where a load-more button exists on a single page, the x-path to that button can be provided to the argument buttonPath.

If each item must be clicked to render its content, a number can be passed to the argument itemWait. Two consecutive item clicks will be separated by at least itemWait seconds.

In MODE_FILE, if we want to save the scrape result once every other 10 items, we should set buffer to 10.

self.scrape(start = 1, pages = 1, perPage = None)

argument	optional	default	available
start	yes	1	[ integer (>0) type ]
pages	yes	1	[ integer (>0) type ]
perPage	yes	None	None, [ integer (>0) type ]

The dynamic scraper will scrape pages pages onward from page start.

In cases where there are too many items on each page, we can set perPage to 50 to scrape just 50 items per page.

(3-D-1) Example of using DynamicScraper

from keyscraper.dynamicscraper import DSFormat, DSInfo, DriverOptions, DynamicScraper
f_site = "https://www.ebay.com/sch/"
f_page = "i.html?_nkw=cpu&_pgn={}"
f_item = DSFormat(xpath = "(//li[contains(@class, 's-item s-item__pl-on-bottom s-item--watch-at-corner')])", multiple = True)
price = DSFormat(xpath = "//span[contains(@class, 's-item__price')]", relative = True, extract = "innerHTML", nickname = "price")
url = DSFormat(xpath = "//a[contains(@class, 's-item__link')]", relative = True, extract = "href", nickname = "url")
f_attr = [ price, url ]
driveroptions = DriverOptions(path = "./chromedriver.exe")
info = DSInfo(f_site, f_page, f_item, f_attr)
scraper = DynamicScraper(info, driveroptions, mode = DynamicScraper.MODE_FILE)
scraper.scrape(start = 1, pages = 2, perPage = 5)

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.1.4

Feb 6, 2022

1.1.3

Feb 6, 2022

1.1.2

Dec 11, 2021

1.1.1

Dec 11, 2021

1.1.0

Dec 11, 2021

1.0.3

Dec 10, 2021

1.0.2 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Cleanup

1.0.1 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Attribute Safeguard

1.0.0 yanked

Dec 10, 2021

Reason this release was yanked:

Pesticide: Import Issues

0.0.0 yanked

Dec 7, 2021

Reason this release was yanked:

Pesticide

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keyscraper-1.1.4.tar.gz (21.5 kB view details)

Uploaded Feb 6, 2022 Source

Built Distribution

keyscraper-1.1.4-py3-none-any.whl (20.2 kB view details)

Uploaded Feb 6, 2022 Python 3

File details

Details for the file keyscraper-1.1.4.tar.gz.

File metadata

Download URL: keyscraper-1.1.4.tar.gz
Upload date: Feb 6, 2022
Size: 21.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for keyscraper-1.1.4.tar.gz
Algorithm	Hash digest
SHA256	`8d05a68739cabe1a5c75cc51d9ddfa281e337b4fd12aef69b255616078d39cba`
MD5	`9202c6eed9af559ce098290c91c4acff`
BLAKE2b-256	`2060324a71ee69e59785824d192ff4470f8d8eb4a944f2c168ea52178b09190e`

See more details on using hashes here.

File details

Details for the file keyscraper-1.1.4-py3-none-any.whl.

File metadata

Download URL: keyscraper-1.1.4-py3-none-any.whl
Upload date: Feb 6, 2022
Size: 20.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.5

File hashes

Hashes for keyscraper-1.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cc05c4750b8099d98b25157bb5ab387c2439553299f5e3b89d9b4bbd4841b177`
MD5	`7d0ab7ed70db36fd4d0bca774eecf008`
BLAKE2b-256	`9adc5a1c79939e4ba8de30eb487ae5a081527d049e9311ba6840eb41c609b0ce`

See more details on using hashes here.

keyscraper 1.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

keyscraper Package Documentation

This library provides various functions which simplifies webpage scraping.

There are three modules in this package.

To install this package, type in command prompt:

[1] Basic Utilities

(1-A) TimeName - Generating a file name composed of the current time:

TimeName(mode = "default")

self.get_name(basename = "", extension = "", f_datetime = None)

(1-A-1) Example of using TimeName (mode: keywind).

(1-A-2) Example of using TimeName (mode: datetime).

(1-B) FileName - Dividing a filename into folder, file and extension:

FileName(filename, mode = "default")

self.__getitem__(key = "all")

(1-B-1) Example of using FileName

(1-C) FileRetrieve - Downloading a file from a direct URL:

FileRetrieve(directlink, filename = None, buffer = 4096, progress_bar = False, overwrite = None)

self.simple_retrieve()

(1-C-1) Example of using FileRetrieve

(1-D) ImageGrabber - Downloading an image from a direct URL:

ImageGrabber(filename, progressBar = False, url_timeout = None)

self.retrieve(directlink, overwrite = None, timeout = None)

(1-D-1) Example of using ImageGrabber

[2] Static Scraper

(2-A) SSFormat - Defining the node attributes to scrape:

SSFormat(element_type, **kwargs)

self.__getitem__(key)

self.get_value(key)

(2-B) SSInfo - Defining information needed for scraping:

SSInfo(f_site, f_page, f_item, f_attr)

self.__getitem__(key)

self.format_page(page)

(2-C) StaticScraper - Scraping a static webpage:

StaticScraper(info, filename = None, mode = "default", timesleep = 0, **kwargs)

self.scrape(start = 1, pages = 1)

(2-C-1) Example of using StaticScraper

[3] Dynamic Scraper

(3-A) DSFormat - Defining the node attributes to scrape:

DSFormat(element_type, **kwargs)

self.__getitem__(key)

(3-B) DSInfo - Defining information needed for scraping:

DSInfo(f_site, f_page, f_item, f_attr)

self.__getitem__(key)

self.format_page(page)

(3-C) DriverOptions - Defining driver:

DriverOptions(mode = "default", path = None, window = True)

(3-D) DynamicScraper - Scraping:

DynamicScraper(info, driveroptions, mode = "default", filename = None, timesleep = 0, buttonPath = None, itemWait = 1, **kwargs)

self.scrape(start = 1, pages = 1, perPage = None)

(3-D-1) Example of using DynamicScraper

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

self.getitem(key = "all")

self.getitem(key)

self.getitem(key)

self.getitem(key)

self.getitem(key)