unisos.wsf: Somewhat General purpose Web Scraping Framework (WSF)
Project description
Overview
unisos.wsf: Somewhat General purpose Web Scraping Framework (WSF)
Support
Documentation
The is no silver bullet for universally providing web scraping capabilities.
Each web site is different and various technologies, encodings and paradigms are used.
What can be done is to create a framework that addresses some of the common aspects of this problem domain.
This is what Web Scraping Framework (WSF) tries to do.
Common aspects of Web scraping include:
- Configuration:
WSF uses python function invocation as the configuration syntax.
See ./unisos/wsf/wsf_config.py for details.
- Simultaneous Parallel Dispatch Of Multiple Urls:
WSF provides a rudimentary mechanism for parallel dispatch.
See ./unisos/wsf/wsf_parallelProc.py for details.
Use of external systems such as celery for workers dispatch is a better solution for larger systems.
- Retrieving Web Content As html:
WSF uses requests to obtain html.
See ./unisos/wsf/wsf_inputs.py for details.
- Digesting html:
WSF uses Beautiful Soup 4 to digest html.
See ./unisos/wsf/wsf_digestHtml.py for details.
- Basic Scraping Facilities:
WSF provides some generic facilities for basic scripting as an abstract class.
See ./unisos/wsf/wsf_scraperBasic.py for details.
The ScraperBasic is an abstract and incomplete class which must be subclassed to become concrete.
Features provided by this principal class are:
Capturing of Config parameters.
Facilities for simple state transition.
Facilities for maintaining results.
- Multipage Scraping Facilities:
WSF provides some facilities for web information that has been paginated.
See ./unisos/wsf/wsf_scraperMultipage.py for details.
The ScraperMultipage(wsf_scraperBasic.ScraperBasic) is an abstract and incomplete class which must be subclassed to become concrete.
ScraperMultipage itself is a subclass of wsf_scraperBasic.ScraperBasic.
- Capturing Results And Writing Results:
WSF provides a rudimentary mechanism for capturing intermediate results and their output.
See ./unisos/wsf/wsf_results.py for details.
- Command Line Mapping:
WSF can be used in combination with unisos.icm (Interactive Command Modules)
ICM can be thought of as a supper set of click which supports plugins as “load” parameters.
Both config files and concrete scraper classes can be passed to ICMs as “load” parameters.
Installation
From PyPi
pip install unisos.wsf
From File System
Go to the wsf/py3 directory.
Run: ./setup.py sdist
Run: pip install --no-cache-dir ./dist/unisos.wsf-0.1.tar.gz
Usage
import unisos.wsf
Use of unisos.wsf involves creating concrete subclasses of the set of abstract classes that wsf provides.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file unisos.wsf-0.1.tar.gz
.
File metadata
- Download URL: unisos.wsf-0.1.tar.gz
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3beca8c98494fb94acd71cf9a8db88975435798a763dbfc8339e91c69ae8113 |
|
MD5 | 304b616980e303c188ee398425722049 |
|
BLAKE2b-256 | e4efb17f966c2cfd672e8097da525307f16c9c7ff287df9fa1715072f6d7501b |