Collection of Scraping tools
Project description
License | Version | ||
Travis CI | Coverage | ||
Wheel | Implementation | ||
Status | Downloads | ||
Supported versions |
ezscrape
ezscrape provides a boilerplate for simple scraping tasks.
It provide generic access to scraping functionality without exposing the user directly the the underlying libraries used (e.g. requests, selenium) when using the scraping functionality. The used scraper is chosen based on the specified config parameters and will prefer the most flexibe / least resource intensive ones if possible.
The exceptions of the underlying modules are handled and converted into the status of the result object.
Setup Requirements
Setup Chrome and Webdriver
For some websites selenium will be used for scraping (e.g. requests doesn't support javascript)
For that purpose the following environment variables need to be set otherwise an exception will be raised when the code is using selenium.
-
The Chrome Executable
CHROME_EXEC_PATH
-
The Chrome Webdriver Executable
CHROME_WEBDRIVER_PATH
Usage
- The basic concept of a simple scrape is
- Create the Scrape Config with a Url
- Optional - set additional parameters on the scrape config
- Scrape with the given Config
- Check the Return Object if the Scrape was succesfull
- Get the HTML from the Return Object
Scrape a simple HTML Page
import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import ScrapeStatus
result = scraper.scrape_url(ScrapeConfig('http://www.website.com'))
if result.status == ScrapeStatus.SUCCESS:
html = result.first_page.html
else:
print(result.error_msg)
Scrape a Page with Multiple Pages
import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import WaitForXpathElem
from ezscrape.scraping.core import ScrapeStatus
config = ScrapeConfig('http://www.website.com')
# Add contition to wait until Element with "title='id'" is loaded"
config.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))
result = scraper.scrape_url(config)
for page in result:
if page.status == ScrapeStatus.SUCCESS:
html = page.html
else:
print(result.error_msg)
Scrape a Page and wait until an Element is Loaded
import ezscrape.scraping.scraper as scraper
from ezscrape.scraping.core import ScrapeConfig
from ezscrape.scraping.core import WaitForXpathElem
from ezscrape.scraping.core import ScrapeStatus
config = ScrapeConfig('http://www.website.com')
# Add contition to wait until Element with "title='id'" is loaded"
config.wait_for_elem_list.append(WaitForXpathElem(R'''//a[@title='id']'''))
result = scraper.scrape_url(config)
if result.status == ScrapeStatus.SUCCESS:
html = result.first_page.html
else:
print(result.error_msg)
Scrape Config
ezscrape.scraping.core.ScrapeConfig
The url is specified when creating the object.
from ezscrape.scraping.core import ScrapeConfig
config = ScrapeConfig('http://some-url.com')
Additional parameters can be specified
Option | Purpose | Type | Default | Example Use Case |
---|---|---|---|---|
ScrapeConfig.url | The URL used for the request | str | N/A | Required for all Requests |
ScrapeConfig.request_timeout | The timeout in seconds of the request | long | 15 | Wait longer before timeout in a slow Network environment. |
ScrapeConfig.page_load_wait | Time ti wait until a page is loaded completely before it times out | int | 5.0 | Specify a longer time if the page loads dynamic elements slowly |
ScrapeConfig.proxy_http | HTTP Proxy to use | str | N/A | Send the request through an HTTP proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS) |
ScrapeConfig.proxy_https | HTTPS Proxy to use | str | N/A | Send the request through an HTTPS proxy (Proxy needs to support the Target protocol i.e. HTTP/HTTPS) |
ScrapeConfig.useragent | Custom Useragent to use | str | Internally Chosen | User want to scrape with a custom Useragent |
ScrapeConfig.max_pages | Maximum Pages to collect if "next_button" specifies | int | 15 | User only wants to return 3 Pages max even if more pages available |
ScrapeConfig.next_button | Add a button element that needs to be loaded and clicked for ultiple pages | ezscrape.scraping.core.WaitForPageElem or one of the subtypes e.g. ezscrape.scraping.core.WaitForXpathElem |
N/A | User wants to return multiple pages if the next page links are generated with Javascript |
ScrapeConfig.wait_for_elem_list | A list of Elements that need to be loaded on the page before returning the scrape result | List of ezscrape.scraping.core.WaitForPageElem or one of the subtypes e.g. ezscrape.scraping.core.WaitForXpathElem |
N/A | User is interested in multiple elements of a Javascript/Ajax page and needs to wait for all to load completely. |
Scrape Status
The following statuses are supported in ezscrape.scraping.core.ScrapeStatus
Status | Meaning |
---|---|
ScrapeStatus.SUCCESS | Scrape Succesfull |
ScrapeStatus.TIMEOUT | A timeout error occured |
ScrapeStatus.PROXY_ERROR | A proxy error occured |
ScrapeStatus.ERROR | A generic error occured |
For non Success cases, additional error details are given in the ScrapeResult object
Scrape Result
The following attributes are available in ezscrape.scraping.core.ScrapeResults
Attribute | Purpose | Type |
---|---|---|
ScrapeResult.url | The url Scraped | str |
ScrapeResult.caller_ip | The caller IP. This is not set for all cases. But where it is it should be reliable e.g. if Scraped through proxy, the proxy IP should be shown) |
str |
ScrapeResult.status | The overall status of the Scrape | ezscrape.scraping.core.ScrapeStatus |
ScrapeResult.error_msg | The error message if the result is not SUCCESS | str |
request_time_ms | The combined scrape time of all pages scraped | float |
first_page | The ScrapePage scraped (first if multiple pages) | ezscrape.scraping.core.ScrapePage |
Scrape Page
The following attributes are available in ezscrape.scraping.core.ScrapePage
Attribute | Purpose | Type |
---|---|---|
html | The HTML content scraped | str |
request_time_ms | the scrape duration for this page | float |
status | The scrape status for this page ScrapePage doesn't have it's own error message. For details check ScrapeResult.error_msg |
ezscrape.scraping.core.ScrapeStatus |
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
License
[1]: # Scrape Result [2]: # Scrape Result
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ezscrape-0.4.tar.gz
.
File metadata
- Download URL: ezscrape-0.4.tar.gz
- Upload date:
- Size: 26.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee60cf1ee4f28315d729b7554e41c311ebf66a39a39d45610f1726374ebcdea4 |
|
MD5 | 66c136f503941758d163913be31136b8 |
|
BLAKE2b-256 | a1e259cbfb6e533db865cfa39f6ec3a8dadfa0e99ab07003f67ebce46313d9a3 |
File details
Details for the file ezscrape-0.4-py2.py3-none-any.whl
.
File metadata
- Download URL: ezscrape-0.4-py2.py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2397c23dc08e72f0cf7e67b21c1156ae64ddd569b83127d6b5d0b43400c97ed2 |
|
MD5 | b583e87b086527fc275a4aa38cf64ca3 |
|
BLAKE2b-256 | cdbb5c21c91cda33b5cde96cdb7f7b2103fa27a7d9af68c9e32487b3b3303591 |