Skip to main content

Data Retrieval Web Engine - Queryable Web Scrap engine build on python based on lxml and Selenium package and using JSON as query construct.

Project description

Data Retrieval Web Engine

Context

Multiple technologies are used as web parsers, web scrapers, spider and so forth. Comparative studies can be found in literature that categorise based on methods and technologies. We took a different perspective and looked at querability feature. Our inspiration comes form OXPath where an extension of XPath is used to "query" and extract semi-structured data from the web.

Objectives

Similarly to OXPath, our objective is to create a tool for data retrieval from the web based on a "query" mechanism. We opted for using JSON constructs for our query definitions with augmented keywords, filters and actions.

Technology stack

The tool is written in Python3 and can be included in other python projects by installing it from the python package index using pip3 install dr-web-engine or integrating with the tools command line interface by running python3 -m web_engine.runner The tool is build on top of several other packages which will be automatically installed. These are:

  • Selenium
  • Geckodriver Autoinstaller
  • LXML
  • Python Interface
  • ArgParse
  • XVFBWrapper

XVFB only works on Linux and if the parameter is True on a Windows or MacOX system you will get an error message.

The Python Package page can be found here

Usage

To use the integrated CLI run python3 -m web_engine.runner. This will display the following help message:

usage: runner.py [-h] [-q QUERY] [-e [ENGINE]] [-ht [HEIGHT]] [-wh [WIDTH]]
                 [-lat [LAT]] [-lon [LON]] [-img [IMG]] [-l [LOG]]
                 [-xvfb [XVFB]]

Web Scrap Engine for semi-structured web data retrieval using JSON query constructs

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        JSON query
  -e [ENGINE], --engine [ENGINE]
                        Engine: use [lxml] for parser engine (default),
                        [selenium] for action based web scraping
  -ht [HEIGHT], --height [HEIGHT]
                        specify the browser window height (default is 800,
                        only used with Selenium engine)
  -wh [WIDTH], --width [WIDTH]
                        specify the browser window width (default is 1280,
                        only used with Selenium engine)
  -lat [LAT], --lat [LAT]
                        Latitude (not specified by default)
  -lon [LON], --lon [LON]
                        Longitude (not specified by default)
  -img [IMG], --img [IMG]
                        Load images
  -l [LOG], --log [LOG]
                        Set flag to True to see verbose logging output
  -xvfb [XVFB], --xvfb [XVFB]
                        Set flag to False to see Firefox when using Selenium
                        engine

There is only one required parameter: -q Query

For example, to run the web data retrieval with the following JSON query (supposedly file test.json):

{
    "_doc":"https://www.google.com/search?q=Donald+Duck",
    "links":[{
        "_base_path": "//div[@id='search'][1]//div[@class='g']",
        "_follow": "//a[@id='pnnext'][1]/@href",
        "link": "//div[@class='rc']/div[@class='r']/a/@href",
        "title": "//h3/text()"
    }]
}

use the following command: python3 -m web_engine.runner -q test.json. The outcome will look like the following JSON result:

{"links": [{"link": ["https://en.wikipedia.org/wiki/Donald_Duck"], 
           "title": ["Donald Duck - Wikipedia"]},
           {"link": ["https://cosleyzoo.org/white-pekin-duck/"],
            "title": ["White Pekin Duck – Cosley Zoo"]},
           {"link": ["https://www.cheatsheet.com/entertainment/donald-duck-turned-85-years-old.html/"],
            "title": ["Donald Duck Turned 85-Years-Old and Disney Fans Are Quacking ..."]},
           {"link": ["https://en.wikipedia.org/wiki/Daisy_Duck"],
            "title": ["Daisy Duck - Wikipedia"]},
           {"link": ["https://www.headstuff.org/culture/history/disney-studios-war-story-donald-duck-became-sgt/"],
            "title": ["Disney Studios At War - the story of how Donald Duck became a Sgt ..."]}

In the JSON query provided, the items starting with _ are keywords and can be filters, actions or instructions. If we remove all the keywords the remaining JSON represents the structure of the expected output.

In another more complex query we use some other keywords and actions:

{
    "_doc": "https://www.checkatrade.com/trades/WayreHouseElectricalServices",
    "data": {
        "ld_data": "//head/script[@type=\"application/ld+json\"][1]"
    },
    "reviews": [{
            "_doc": "https://www.checkatrade.com/trades/WayreHouseElectricalServices/reviews",
            "_base_path": "//div[contains(@class, 'ReviewsPage__Content')]//div[contains(@class, 'ReviewsItem__Wrapper')]",
            "_key": "review",
            "_pre_remove": "//*[contains(@class,'alert-box')]",
            "_follow": "//a[contains(@class,\"Chevrons__Wrapper\")][2]/@href",
            "_follow_action": "//a[contains(@class,\"Chevrons__Wrapper\")][2]{click }",
            "title": "//h3[contains(@class, 'ReviewsItem__Title')]",
            "score": "//*[name()='svg']//text()[normalize-space()]",
            "verified": "//div[contains(@class, 'ReviewsItem__Verified')]/text()[normalize-space()]",
            "content": "//p[contains(@class, 'ReviewsItem__P')]",
            "review_by": "//div[contains(@class, 'ReviewsItem__Byline')]/text()[normalize-space()]"
        }]
}

Keywords

_doc: Represents the document to follow. Is usually a URL to a web page. It is compulsory on the top level and can be provided on the lower levels of the hierarchical structure. The _doc keyword

_base_path: To be used in an array extraction. Arrays are a lists of element and are defined in the query as an JSON array []. When _base_path is provided, all elements of the query in the array will be looked inside the HTML element as defined by _base_path.

_key: Use to assign each element of the array to assigned to the variable _key

_pre_xxx: All actions that start with _pre_ are to be executed before data extraction.

_pre_remove: Remove elements from page

_follow: Follow the link if and when exists

_follow_action: If element in follow exists, then perform action rather than following the link. The actions are defined as the last part of the XPath query and are always defined between carley brackets. In this case the action {click } means click on element.

Extendability

The package is intended to be easily extendable. For example the {click } action is defined in the query as follows:

The corresponding is defined as follows:

class ClickAction(implements(Action)):

    def __init__(self, receiver, log: logging = None):
        self._log = log
        self._receiver = receiver

    def execute(self, *args):
        if args is None or len(args) != 1:
            return
        xpath_selector: str = args[0]
        wait = WebDriverWait(self._receiver.driver, 10)
        elem = wait.until(EC.element_to_be_clickable((By.XPATH, xpath_selector)))
        elem.click()

In the Scraper implementations, actions are registered against keywords as follows:

     click = ClickAction(self, self.log)
     filter_remove = FilterRemoveAction(self, self.log)
     self.register('click', click)
     self.register('remove', filter_remove)

And invoked by simply matching the action keywords in the query as follows:

    def action_get(self, actions: list):
        for x in actions:
            self.execute(x)
        return self.get()


    def execute(self, action_composite: str):
        action_name, action_path = SeleniumScraper.__get_action(action_composite)
        action_name = action_name.strip()
        if action_name in self._actions.keys():
            self._history.append((time.time(), action_name))
            self._actions[action_name].execute(action_path)
        else:
            self.log.warn(f"Command [{action_name}] not recognised")


    @staticmethod
    def __get_action(action_composite):
        pattern = '{(.+?)}'
        matches = re.search(pattern, action_composite)
        if not matches:
            return None, None
        action_name = matches.group(1)
        action_xpath = re.sub(pattern, '', action_composite)
        return action_name, action_xpath

Future work

This work, whilst it is a working beta, is by no means complete and it's rather focused on a narrow specific problem. However, special effort has been made to keep the solution generic, universal and extendable for it to potentially grow into a mature Data Retrieval Web Engine based on JSON Queries.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dr-web-engine-0.3.2.2b0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

dr_web_engine-0.3.2.2b0-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file dr-web-engine-0.3.2.2b0.tar.gz.

File metadata

  • Download URL: dr-web-engine-0.3.2.2b0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.4

File hashes

Hashes for dr-web-engine-0.3.2.2b0.tar.gz
Algorithm Hash digest
SHA256 23a6d73657646c0d1a897eec664e3865eb5a00b77c03bd016e267b945491e1a8
MD5 e3fa68f5249eb43c43d159c65ea6d3eb
BLAKE2b-256 fb93afdbac5dda5df892e86b91af7383bfdcaabd295b977850d8850186a7ea5f

See more details on using hashes here.

File details

Details for the file dr_web_engine-0.3.2.2b0-py3-none-any.whl.

File metadata

  • Download URL: dr_web_engine-0.3.2.2b0-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.4

File hashes

Hashes for dr_web_engine-0.3.2.2b0-py3-none-any.whl
Algorithm Hash digest
SHA256 73706103d6e8b98a91b709bdcf2bd7fb8c88fed560748b5539623106b8117e6a
MD5 8231d7a8fdaf935879d9d5e77647a724
BLAKE2b-256 b0516018097535de95faf3e48387af0c2b72601e9d774b7d61425ab71c8917ec

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page