scrapepath

Templated scraping syntax

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Scrapepath

Scrapepath is a templated web scraping syntax. Scrapepath is pip installable via pip install scrapepath.

Requirements

Install the required Python dependencies using the provided requirements.txt file, by:

pip install -r requirements.txt

Usage

To run an example, execute on the command line without arguments:

./parser

To use within Python:

from parser import NodeParser

np = NodeParser(soup_template, soup, live_url)
np.hop_template()
print (json.dumps(np.result_dict, indent = 2, default = str))

Where soup_template is a BeautifulSoup of the template file, soup is a BeautifulSoup of the scraped page and live_url the url of the scraped page.

Templates

HTML pages are scraped using HTML templates, consisting of a mixture of the most important tags, and statements.

Templates consist of HTML files containing nested tags leading to the scraping element of interest.

The parser is based on BeautifulSoup.

Example 1: Scraping data

The following examples are from scraped pages examples/example1a.html and template examples/scraped1.html. Run the example using:

./parser.py examples/example1a.html examples/scraped1.html

This scrapes the target page scraped1.html using the template example1a.html. The text item "Tea" is scraped from the target page using the record attribute in the template page. A path to the target text ("Tea") is specified in the template using tags that correspond to the target page. So, to scrape from:

<ul class = "my_list">
  <li class = "my_item">Coffee</li>
  <li class = "my_item"><span class = "cuppa">Tea</span></li>
  <li class = "my_item">Milk</li>
</ul>

Use template:

<ul class = "my_list">
  <span class = "cuppa" record = "text as favorite"></span>
</ul>

This yields a dictionary containing the scraped data under the key "favorite" as specified in the record attribute:

{
  "favorite": "Tea"
}

The text statement within the record attribute corresponds to a function that obtains text from inside the HTML tag, and favorite is the key to record the data against. The text function can be replaced with custom Python functions.

Starting from the outer node, <ul> , in the template, the parser looks for the first node in the scraped page that matches the template node in type and attributes. In this case, matching a ul with a ul, and class my_list with class my_list. Then, the same search takes place using the template node children, now confined within the children of the scraped node. So nested template nodes represent paths. The <li> node is not included in the template, as it would point the search to the first element of the list.

In this case, nesting the template nodes is needlessly specific. There are no other nodes of class "cuppa", so we can omit the <ul> and <li> items, and the following template will record the same data:

<span class = "cuppa" record = "text as favorite"></span>

So paths along many nested nodes in the scraped page can be summarized by only a few nodes that define a unique path to the scraped data.

Loops:

A for loop scrapes all items in the list. In this simple example, we record only one variable (item_text) per item:

Template:

    <ul class = "my_list">
      <for items = "items" condition = "i < 5">
        <li class ="my_item" record = "text as item_text">
        </li>
      </for>
    </ul>

This results in the output:

{
  "items": [
    {
      "item_text": "Coffee"
    },
    {
      "item_text": "Tea"
    },
    {
      "item_text": "Milk"
    },
    {
      "item_text": "Biscuits"
    },
    {
      "item_text": "Chocolate"
    }
  ]
}

Here, the parser matches all the children of the <for> template node to the children of the <ul> node in the scraped page scraped1.html . Run the example using: ./parser.py examples/example1b.html examples/scraped1.html. The condition node indicates that only the first 5 items should be recorded, where i is the loop counter variable.

Example 2: for loops on mixed nodes

In the following html, a <for> template loop node needs to enclose two template nodes, one for each tag (div and p) and class (my_item and milk_class):

To scrape from:

<div class = "my_list">
  <div class = "my_item">Coffee</div>
  <div class = "my_item"><span class = "cuppa">Tea</span></div>
  <p class = "milk_class">Milk</p>
  <div class = "my_item">Biscuits</div>
  Chocolate
</div>

Use template:

<div class = "my_list">
  <for items = "items" >
    <div class ="my_item" record = "text as item_text"></div>
    <p class ="milk_class" record = "text as item_text"></p>
  </for>
</div>

However, the <for> template loop node is unable to record the text element "chocolate", as the <for> only looks for proper nodes among the children of the <div class = "my_list"> node. To do this, a <forchild> template loop node is needed, along with a <str> template node to record the NavigableString element "chocolate":

Template:

<div class = "my_list">
  <forchild items = "items_with_string" >
    <div class ="my_item" record = "text as item_text"></div>
    <p class ="milk_class" record = "text as item_text"></p>
    <str record = "text as item_text"></div>
  </forchild>
</div>

In this case, the parser looks for the first match to the first template node (the child of the <for> node), and loops over its sibblings, probing with all template nodes (the children of this for node). Run this example using examples/example1b.html and examples/scraped1.html.

Example 3: Jumping to linked pages

Follow links on pages using the <jump> template node:

To scrape from:

<a href="example_linked.html"></a>

Use template:

    <a record = "href as my_link">
      <jump on = "my_link">
        <ibody>
          <div class = "message" record = "text as msg_from_link"></div>
        </ibody>
      </jump>
    <a>

Here, the nodes within the <jump> node act on the linked page.

This example is invoked with:

./parser.py examples/example3a.html examples/scraped3.html

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Jun 13, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapepath-0.1.1.tar.gz (3.7 kB view details)

Uploaded Jun 13, 2019 Source

File details

Details for the file scrapepath-0.1.1.tar.gz.

File metadata

Download URL: scrapepath-0.1.1.tar.gz
Upload date: Jun 13, 2019
Size: 3.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.2

File hashes

Hashes for scrapepath-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`0aa647d004eedb7b8a5805ec184d1bb5f2dc4e4184b6d44196a9fdeb42ac6a6e`
MD5	`f5bde24507caa196a3a8d744c93653a8`
BLAKE2b-256	`441067c3eee880479002ce7c58b7b92664a71bd50aed17d44acc25ca922fd0bd`

See more details on using hashes here.

scrapepath 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrapepath

Requirements

Usage

Templates

Example 1: Scraping data

Example 2: for loops on mixed nodes

Example 3: Jumping to linked pages

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes