Skip to main content

unisos.wsf: Somewhat General purpose Web Scraping Framework (WSF)

Project description

Overview

unisos.wsf: Somewhat General purpose Web Scraping Framework (WSF)

Support

For support, criticism, comments and questions; please contact the author/maintainer

Documentation

The is no silver bullet for universally providing web scraping capabilities.

Each web site is different and various technologies, encodings and paradigms are used.

What can be done is to create a framework that addresses some of the common aspects of this problem domain.

This is what Web Scraping Framework (WSF) tries to do.

Common aspects of Web scraping include:

Configuration:

WSF uses python function invocation as the configuration syntax.

See ./unisos/wsf/wsf_config.py for details.
Simultaneous Parallel Dispatch Of Multiple Urls:

WSF provides a rudimentary mechanism for parallel dispatch.

See ./unisos/wsf/wsf_parallelProc.py for details.

Use of external systems such as celery for workers dispatch is a better solution for larger systems.

Retrieving Web Content As html:

WSF uses requests to obtain html.

See ./unisos/wsf/wsf_inputs.py for details.
Digesting html:

WSF uses Beautiful Soup 4 to digest html.

See ./unisos/wsf/wsf_digestHtml.py for details.
Basic Scraping Facilities:

WSF provides some generic facilities for basic scripting as an abstract class.

See ./unisos/wsf/wsf_scraperBasic.py for details.

The ScraperBasic is an abstract and incomplete class which must be subclassed to become concrete.

Features provided by this principal class are:

  • Capturing of Config parameters.

  • Facilities for simple state transition.

  • Facilities for maintaining results.

Multipage Scraping Facilities:

WSF provides some facilities for web information that has been paginated.

See ./unisos/wsf/wsf_scraperMultipage.py for details.

The ScraperMultipage(wsf_scraperBasic.ScraperBasic) is an abstract and incomplete class which must be subclassed to become concrete.

ScraperMultipage itself is a subclass of wsf_scraperBasic.ScraperBasic.

Capturing Results And Writing Results:

WSF provides a rudimentary mechanism for capturing intermediate results and their output.

See ./unisos/wsf/wsf_results.py for details.
Command Line Mapping:

WSF can be used in combination with unisos.icm (Interactive Command Modules)

ICM can be thought of as a supper set of click which supports plugins as “load” parameters.

Both config files and concrete scraper classes can be passed to ICMs as “load” parameters.

Installation

From PyPi

pip install unisos.wsf

From File System

Go to the wsf/py3 directory.

Run: ./setup.py sdist
Run: pip install --no-cache-dir ./dist/unisos.wsf-0.1.tar.gz

Usage

import unisos.wsf

Use of unisos.wsf involves creating concrete subclasses of the set of abstract classes that wsf provides.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unisos.wsf-0.1.tar.gz (22.0 kB view details)

Uploaded Source

File details

Details for the file unisos.wsf-0.1.tar.gz.

File metadata

  • Download URL: unisos.wsf-0.1.tar.gz
  • Upload date:
  • Size: 22.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2

File hashes

Hashes for unisos.wsf-0.1.tar.gz
Algorithm Hash digest
SHA256 b3beca8c98494fb94acd71cf9a8db88975435798a763dbfc8339e91c69ae8113
MD5 304b616980e303c188ee398425722049
BLAKE2b-256 e4efb17f966c2cfd672e8097da525307f16c9c7ff287df9fa1715072f6d7501b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page