Skip to main content

Templated scraping syntax

Project description

Scrapepath

Scrapepath is a templated web scraping syntax. Scrapepath is pip installable via pip install scrapepath.

Requirements

Install the required Python dependencies using the provided requirements.txt file, by:

pip install -r requirements.txt

Usage

To run an example, execute on the command line without arguments:

./parser

To use within Python:

from parser import NodeParser

np = NodeParser(soup_template, soup, live_url)
np.hop_template()
print (json.dumps(np.result_dict, indent = 2, default = str))

Where soup_template is a BeautifulSoup of the template file, soup is a BeautifulSoup of the scraped page and live_url the url of the scraped page.

Templates

HTML pages are scraped using HTML templates, consisting of a mixture of the most important tags, and statements.

Templates consist of HTML files containing nested tags leading to the scraping element of interest.

The parser is based on BeautifulSoup.

Example 1: Scraping data

The following examples are from scraped pages examples/example1a.html and template examples/scraped1.html. Run the example using:

./parser.py examples/example1a.html examples/scraped1.html

This scrapes the target page scraped1.html using the template example1a.html. The text item "Tea" is scraped from the target page using the record attribute in the template page. A path to the target text ("Tea") is specified in the template using tags that correspond to the target page. So, to scrape from:

<ul class = "my_list">
  <li class = "my_item">Coffee</li>
  <li class = "my_item"><span class = "cuppa">Tea</span></li>
  <li class = "my_item">Milk</li>
</ul>

Use template:

<ul class = "my_list">
  <span class = "cuppa" record = "text as favorite"></span>
</ul>

This yields a dictionary containing the scraped data under the key "favorite" as specified in the record attribute:

{
  "favorite": "Tea"
}

The text statement within the record attribute corresponds to a function that obtains text from inside the HTML tag, and favorite is the key to record the data against. The text function can be replaced with custom Python functions.

Starting from the outer node, <ul> , in the template, the parser looks for the first node in the scraped page that matches the template node in type and attributes. In this case, matching a ul with a ul, and class my_list with class my_list. Then, the same search takes place using the template node children, now confined within the children of the scraped node. So nested template nodes represent paths. The <li> node is not included in the template, as it would point the search to the first element of the list.

In this case, nesting the template nodes is needlessly specific. There are no other nodes of class "cuppa", so we can omit the <ul> and <li> items, and the following template will record the same data:

<span class = "cuppa" record = "text as favorite"></span>

So paths along many nested nodes in the scraped page can be summarized by only a few nodes that define a unique path to the scraped data.

Loops:

A for loop scrapes all items in the list. In this simple example, we record only one variable (item_text) per item:

Template:

    <ul class = "my_list">
      <for items = "items" condition = "i < 5">
        <li class ="my_item" record = "text as item_text">
        </li>
      </for>
    </ul>

This results in the output:

{
  "items": [
    {
      "item_text": "Coffee"
    },
    {
      "item_text": "Tea"
    },
    {
      "item_text": "Milk"
    },
    {
      "item_text": "Biscuits"
    },
    {
      "item_text": "Chocolate"
    }
  ]
}

Here, the parser matches all the children of the <for> template node to the children of the <ul> node in the scraped page scraped1.html . Run the example using: ./parser.py examples/example1b.html examples/scraped1.html. The condition node indicates that only the first 5 items should be recorded, where i is the loop counter variable.

Example 2: for loops on mixed nodes

In the following html, a <for> template loop node needs to enclose two template nodes, one for each tag (div and p) and class (my_item and milk_class):

To scrape from:

<div class = "my_list">
  <div class = "my_item">Coffee</div>
  <div class = "my_item"><span class = "cuppa">Tea</span></div>
  <p class = "milk_class">Milk</p>
  <div class = "my_item">Biscuits</div>
  Chocolate
</div>

Use template:

<div class = "my_list">
  <for items = "items" >
    <div class ="my_item" record = "text as item_text"></div>
    <p class ="milk_class" record = "text as item_text"></p>
  </for>
</div>

However, the <for> template loop node is unable to record the text element "chocolate", as the <for> only looks for proper nodes among the children of the <div class = "my_list"> node. To do this, a <forchild> template loop node is needed, along with a <str> template node to record the NavigableString element "chocolate":

Template:

<div class = "my_list">
  <forchild items = "items_with_string" >
    <div class ="my_item" record = "text as item_text"></div>
    <p class ="milk_class" record = "text as item_text"></p>
    <str record = "text as item_text"></div>
  </forchild>
</div>

In this case, the parser looks for the first match to the first template node (the child of the <for> node), and loops over its sibblings, probing with all template nodes (the children of this for node). Run this example using examples/example1b.html and examples/scraped1.html.

Example 3: Jumping to linked pages

Follow links on pages using the <jump> template node:

To scrape from:

<a href="example_linked.html"></a>

Use template:

    <a record = "href as my_link">
      <jump on = "my_link">
        <ibody>
          <div class = "message" record = "text as msg_from_link"></div>
        </ibody>
      </jump>
    <a>

Here, the nodes within the <jump> node act on the linked page.

This example is invoked with:

./parser.py examples/example3a.html examples/scraped3.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapepath-0.1.1.tar.gz (3.7 kB view details)

Uploaded Source

File details

Details for the file scrapepath-0.1.1.tar.gz.

File metadata

  • Download URL: scrapepath-0.1.1.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.2

File hashes

Hashes for scrapepath-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0aa647d004eedb7b8a5805ec184d1bb5f2dc4e4184b6d44196a9fdeb42ac6a6e
MD5 f5bde24507caa196a3a8d744c93653a8
BLAKE2b-256 441067c3eee880479002ce7c58b7b92664a71bd50aed17d44acc25ca922fd0bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page