unisos.wsfClassicCars: Classic Cars App Based On WSF (Web Scraping Framework)
Project description
Overview
unisos.wsfClassicCars: Classic Cars App Based On WSF (Web Scraping Framework)
Support
Documentation
This unisos.wsfClassicCars module is usage example of unisos.wsf.
For details of Web Scraping Framework (wsf), see that module’s documentation.
Installation
This module is provided as a tar file.
Go to the wsfClassicCars/py3 directory.
Run: ./setup.py sdist
Run: pip install --no-cache-dir ./dist/unisos.wsfClassicCars-0.1.tar.gz
4 file will be created in your venv/bin directory. These are copies of the ones in the ./bin directory.
Usage
In wsfClassicCars/py3, the following files control and run the wsfClassicCars scraper.
- ./bin/classicCarsScraperParams.py:
This is the configuration file for this App. WSF uses python function invocation as the configuration syntax. Even for those unfamiliar with Python, the syntax is intuitive. You can modify these parameters to your liking.
- ./bin/scraperClassicCars.py:
This is the class that implements the concrete class that scrapes the inputs.
The last invocation in that file:
wsf_config.scrapingProcessor( scraperClass=ClassicCars, )
passes:
class ClassicCars(wsf_scraperMultipage.ScraperMultipage):
to the config machinery.
- ./bin/scrapeExample.py:
This is example of how to run the scraper in full, minimally.
The entire relevant code is:
import classicCarsScraperParams import scraperClassicCars from unisos.wsf import wsf_parallelProc if __name__ == '__main__': wsf_parallelProc.dispatchWorkersUsingParams()
The first two imports bring over the concrete class and set configuration parameters.
The main entry to wsf is wsf_parallelProc.dispatchWorkersUsingParams()
- ./bin/icmClassicCarsWebScraper.py:
This is the preferred way of running this App on the command line.
Running the ICM (Interactive Command Module) by itself as:
icmClassicCarsWebScraper.py
Gives you a list of commands that you can pick and run.
Choose:
icmClassicCarsWebScraper.py --load classicCarsScraperParams.py --load scraperClassicCars.py -i scrape
(run all of it in one line.)
Parameters and the concrete class are first loaded, then the “scrape” command is executed.
For debugging purposes, if needed, you can enable verbosity and callTracking with:
icmClassicCarsWebScraper.py -v 1 --callTrackings monitor+ --callTrackings invoke+ --load classicCarsScraperParams.py --load scraperClassicCars.py -i scrape
(run all of it in one line.)
Context And History
I, Mohsen BANAN, have put togther this as a sample of my python code.
I could use a web scraper development framework for a project that I was doing and decided to make this part of it public.
Here is the process that I went through to put this together in 2020.
Initial Web Searches
I first searched the web to see if this, or something similar, has been done before. I found the following relevant pointers:
https://github.com/nneibaue/yukon_cornelius
This is a scraper for oldclassiccar.co.uk.
The design and modeling quality is not great. But the code and some the design is re-usable and I have used it. Later, I’ll revisit these.
-
Nothing useful there.
PyPi Web Scraping Engines/Tools/Packages.
There are several there. But I did not find any that I liked.
The “Web Scraping Development Framework” Model
I decided to build a web scraping development framework and then immediately use it for my own projects and also have it scrape oldclassiccar.co.uk.
Very much by choice, I avoided calling it a “web scraping engine”. The domain of web scraping is too broad and too diverse to be reasonably codified as an “engine”.
Using web scraping development framework (wsdf), a developer can quickly customize the specifics of a particular site’s scraping. The common aspects of web scraping go into wsdf.
About unisos.wsf
unisos.wsf is a pip package included in this repo.
It is a generalized scraping framework that can be considered a public resource. There is nothing in wsf which is specific to oldclassiccar.co.uk or any other site in there.
In this case, unisos is just a namespace to avoid name conflicts.
About unisos.wsfClassicCars
unisos.wsfClassicCars is also a pip package. unisos.wsfClassicCars uses unisos.wsf.
The code in unisos.wsfClassicCars is very minimal.
Configuration file, the concrete ClassicCars class and the executable are all in the bin directory.
About Contents Of This Repo
After untar-ing, you will have two directories.
wsf
wsfClassicCars
There are two files that you need to read.
wsfClassicCars/py3/README.pdf
wsf/py3/README.pdf
Installation
I have tested these with Python 3.9. Both packages will likely work fine with earlier Python 3.x release.
Create a fresh virtual environments. Install the two packages by following these instructions:
Go to wsf/py3. Follow the instructions in READEME.pdf Section 4.2.
Go to wsfClassicCars/py3. Follow the instructions in READEME.pdf Section 4.
The “requires” section of wsf/py3/setup.py enumerates all other package dependencies.
A pip list after the installation should produce something like:
Package Version --------------------- --------- beautifulsoup4 4.10.0 certifi 2021.10.8 charset-normalizer 2.0.7 enum34 1.1.10 idna 3.3 lxml 4.6.4 pip 21.3.1 requests 2.26.0 setuptools 58.3.0 soupsieve 2.3.1 unisos.icm 0.25 unisos.ucf 0.15 unisos.wsf 0.1 unisos.wsfClassicCars 0.1 urllib3 1.26.7 wheel 0.37.0
About unisos.icm
I want Web Scraping Application (WS-Apps) to function as plug-able modules on the command line interface.
unisos.wsfClassicCars is a WS-App.
ICM (Interactive Command Modules) is a pip package that I have developed. It is similar to “click” but it also supports “–load fileName”. fileName can be any python code. This is how wsfClassicCars becomes a plug-able command line module.
Also, the flexibility that ICM provides allows for regression testing of whole or parts of the code. This renders the usual traditions of unit testing obsolete.
About COMEEGA And Dynamic Blocks
Parts of my code are written as COMEEGA. COMEEGA stands for “Collaborative Org-Mode Enhanced Emacs Generalized Authorship”. Think of it as inverse of Literate Programming. Where the code is also a document. You can switch between code mode and document mode by switching between org-mode and python-mode.
Without emacs and org-mode, such code is not pleasant. I wont use COMEEGA on other people’s code.
Dynamic Blocks are a feature of org-mode. What is between +BEGIN: and +END: is controlled with lisp code and will be overwritten if edited.
This allows me to add visible macro capabilities to python.
Both COMEEGA and Dynamic Blocks are mostly used in icmClassicCarsWebScraper.py. If you view that as un-pleasant, I suggest that you just consider it as awareness of other powerful ways of doing things …
Design And Implementation Considerations
I did all of this on a rush basis. So, the code is weak in terms of error handling and robustness. But, there is a proper starting point in place and over time it can improve and expand.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file unisos.wsfClassicCars-0.1.tar.gz
.
File metadata
- Download URL: unisos.wsfClassicCars-0.1.tar.gz
- Upload date:
- Size: 25.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4b555de2f395d0c5f061a8b2839c35fdaaf18a76c6d054b5d0539ae92135e2c |
|
MD5 | 4b3c333a00c48537733bfe98529ce1da |
|
BLAKE2b-256 | 2eb5fcecfb5636e5dc2bcab3e11ba2dc9def586c5ca5f74a59f57f07e0ebdace |