Collection of Scraping tools

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3.7
- Python :: Implementation :: CPython
Topic
- Software Development :: Libraries :: Python Modules

Project description

License		Version
Travis CI		Coverage
Wheel		Implementation
Status		Downloads
Supported versions

ezscrape

ezscrape provides a boilerplate for simple scraping tasks.

It provide generic access to scraping functionality without exposing the user directly the the underlying libraries used (e.g. requests, selenium) when using the scraping functionality. The used scraper is chosen based on the specified config parameters and will prefer the most flexibe / least resource intensive ones if possible.

The exceptions of the underlying modules are handled and converted into the status of the result object.

Setup Requirements

Setup Chrome and Webdriver

For some websites selenium will be used for scraping (e.g. requests doesn't support javascript)

For that purpose the following environment variables need to be set otherwise an exception will be raised when the code is using selenium.

The Chrome Executable

CHROME_EXEC_PATH
The Chrome Webdriver Executable

CHROME_WEBDRIVER_PATH

Usage

The basic concept of a simple scrape is
Create the Scrape Config with a Url
Optional - set additional parameters on the scrape config
Scrape with the given Config
Check the Return Object if the Scrape was succesfull
Get the HTML from the Return Object

Scrape a simple HTML Page


import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import ScrapeStatus

result = scraper.scrape_url(ScrapeConfig('http://www.website.com'))

if result.status == ScrapeStatus.SUCCESS:
    html = result.first_page.html
else:
    print(result.error_msg)

Scrape a Page with Multiple Pages


import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import WaitForXpathElem
from ezscrape.scraping.core import ScrapeStatus

config = ScrapeConfig('http://www.website.com')
# Add contition to wait until Element with "title='id'" is loaded"
config.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))

result = scraper.scrape_url(config)

for page in result:
    if page.status == ScrapeStatus.SUCCESS:
        html = page.html
    else:
        print(result.error_msg)

Scrape a Page and wait until an Element is Loaded


import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import WaitForXpathElem
from ezscrape.scraping.core import ScrapeStatus


config = ScrapeConfig('http://www.website.com')
# Add contition to wait until Element with "title='id'" is loaded"
config.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))

result = scraper.scrape_url(config)

if result.status == ScrapeStatus.SUCCESS:
    html = result.first_page.html
else:
    print(result.error_msg)

Scrape Config

ezscrape.scraping.core.ScrapeConfig

The url is specified when creating the object.

from ezscrape.scraping.core import ScrapeConfig

config = ScrapeConfig('http://some-url.com')

Additional parameters can be specified

Option	Purpose	Type	Default	Example Use Case
ScrapeConfig.url	The URL used for the request	str	N/A	Required for all Requests
ScrapeConfig.request_timeout	The timeout in seconds of the request	long	15	Wait longer before timeout in a slow Network environment.
ScrapeConfig.page_load_wait	Time ti wait until a page is loaded completely before it times out	int	5.0	Specify a longer time if the page loads dynamic elements slowly
ScrapeConfig.proxy_http	HTTP Proxy to use	str	N/A	Send the request through an HTTP proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS)
ScrapeConfig.proxy_https	HTTPS Proxy to use	str	N/A	Send the request through an HTTPS proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS)
ScrapeConfig.useragent	Custom Useragent to use	str	Internally Chosen	User want to scrape with a custom Useragent
ScrapeConfig.max_pages	Maximum Pages to collect if "next_button" specifies	int	15	User only wants to return 3 Pages max even if more pages available
ScrapeConfig.next_button	Add a button element that needs to be loaded and clicked for ultiple pages	ezscrape.scraping.core.WaitForPageElem or one of the subtypes e.g. ezscrape.scraping.core.WaitForXpathElem	N/A	User wants to return multiple pages if the next page links are generated with Javascript
ScrapeConfig.wait_for_elem_list	A list of Elements that need to be loaded on the page before returning the scrape result	List of ezscrape.scraping.core.WaitForPageElem or one of the subtypes e.g. ezscrape.scraping.core.WaitForXpathElem	N/A	User is interested in multiple elements of a Javascript/Ajax page and needs to wait for all to load completely.

Scrape Status

The following statuses are supported in ezscrape.scraping.core.ScrapeStatus

Status	Meaning
ScrapeStatus.SUCCESS	Scrape Succesfull
ScrapeStatus.TIMEOUT	A timeout error occured
ScrapeStatus.PROXY_ERROR	A proxy error occured
ScrapeStatus.ERROR	A generic error occured

For non Success cases, additional error details are given in the ScrapeResult object

Scrape Result

The following attributes are available in ezscrape.scraping.core.ScrapeResults

Attribute	Purpose	Type
ScrapeResult.url	The url Scraped	str
ScrapeResult.caller_ip	The caller IP. This is not set for all cases. But where it is it should be reliable e.g. if Scraped through proxy, the proxy IP should be shown)	str
ScrapeResult.status	The overall status of the Scrape	ezscrape.scraping.core.ScrapeStatus
ScrapeResult.error_msg	The error message if the result is not SUCCESS	str
request_time_ms	The combined scrape time of all pages scraped	float
first_page	The ScrapePage scraped (first if multiple pages)	ezscrape.scraping.core.ScrapePage

Scrape Page

The following attributes are available in ezscrape.scraping.core.ScrapePage

Attribute	Purpose	Type
html	The HTML content scraped	str
request_time_ms	the scrape duration for this page	float
status	The scrape status for this page ScrapePage doesn't have it's own error message. For details check ScrapeResult.error_msg	ezscrape.scraping.core.ScrapeStatus

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

GPLv3

[1]: # Scrape Result [2]: # Scrape Result

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
Operating System
- OS Independent
Programming Language
- Python :: 3.7
- Python :: Implementation :: CPython
Topic
- Software Development :: Libraries :: Python Modules

Release history Release notifications | RSS feed

This version

0.4

Apr 13, 2020

0.3

Jul 26, 2019

0.2

Jul 24, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ezscrape-0.4.tar.gz (26.2 kB view details)

Uploaded Apr 13, 2020 Source

Built Distribution

ezscrape-0.4-py2.py3-none-any.whl (26.1 kB view details)

Uploaded Apr 13, 2020 Python 2Python 3

File details

Details for the file ezscrape-0.4.tar.gz.

File metadata

Download URL: ezscrape-0.4.tar.gz
Upload date: Apr 13, 2020
Size: 26.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0

File hashes

Hashes for ezscrape-0.4.tar.gz
Algorithm	Hash digest
SHA256	`ee60cf1ee4f28315d729b7554e41c311ebf66a39a39d45610f1726374ebcdea4`
MD5	`66c136f503941758d163913be31136b8`
BLAKE2b-256	`a1e259cbfb6e533db865cfa39f6ec3a8dadfa0e99ab07003f67ebce46313d9a3`

See more details on using hashes here.

File details

Details for the file ezscrape-0.4-py2.py3-none-any.whl.

File metadata

Download URL: ezscrape-0.4-py2.py3-none-any.whl
Upload date: Apr 13, 2020
Size: 26.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0

File hashes

Hashes for ezscrape-0.4-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`2397c23dc08e72f0cf7e67b21c1156ae64ddd569b83127d6b5d0b43400c97ed2`
MD5	`b583e87b086527fc275a4aa38cf64ca3`
BLAKE2b-256	`cdbb5c21c91cda33b5cde96cdb7f7b2103fa27a7d9af68c9e32487b3b3303591`

See more details on using hashes here.

ezscrape 0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ezscrape

Setup Requirements

Setup Chrome and Webdriver

Usage

Scrape a simple HTML Page

Scrape a Page with Multiple Pages

Scrape a Page and wait until an Element is Loaded

Scrape Config

Scrape Status

Scrape Result

Scrape Page

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes