Skip to main content

Scrape HTML to dictionaries

Project description

Scrape HTML to dictionaries

Usage overview

  1. Create a dictionary with rules to extract the right information
  2. Create the extraction function with the rules dictionary
  3. Use the extraction function on a BeautifulSoup object

Examples

  1. To start let's cook the HTML to get a BeautifulSoup object. For this example the string in sample will be used as the target HTML - normally you would fetch the HTML with your favorite HTTP library or read from the disk. The soup variable holds the BeautifulSoup object.
from pprint import pprint
import scrapedict as sd


sample = """<html>
<head>
  <title>Page title</title>
</head>
<body>
  <header>
    <h1>Big Header</h1>
  </header>
  <article>
    <p class="abstract">Paragraph of the article.</p>
    <ol>
      <li>
        <span class='description'>first</span>
        <a href="http://first.example.com">link</a>
      </li>
      <li>
        <span class='description'>second</span>
        <a href="http://second.example.com">link</a>
      </li>
      <li>
        <span class='description'>third</span>
        <a href="http://third.example.com">link</a>
      </li>
    </ol>
  </article>
  <footer><i>Page footer - <a href="http://example.com/">link</a></i></footer>
</body>
</html>"""

soup = sd.cook(sample)

Extract a single item

  1. Define a rules dictionary
rules = {
    "header": sd.html("header"),
    "article": sd.text(".abstract"),
    "footer_link": sd.attr("footer a", "href"),
}
  1. Create an extractor function feeding it the rules

item_extractor is a function that knows how to extract your item from any soup

item_extractor = sd.extract(rules)
  1. Use the extractor function feeding it the soup
item = item_extractor(soup)

pprint(item)

output:

{'article': '\nParagraph of the article.\n',
 'footer_link': 'http://example.com/',
 'header': <header>
<h1>Big Header</h1>
</header>}

Extract multiple items

  1. Define the rules to extract each item
rules = dict(description=sd.text(".description"), url=sd.attr("a", "href"))
  1. Create the extractor function passing in a selector where the rules should be applied

this selector should emmit multiple items

list_item_extractor = sd.extract_all("article ol li", rules)
  1. Use the extractor function feeding it the soup
items_list = list_item_extractor(soup)

pprint(items_list)

output:

[{'description': 'first', 'url': 'http://first.example.com'},
 {'description': 'second', 'url': 'http://second.example.com'},
 {'description': 'third', 'url': 'http://third.example.com'}]

Testing

For testing use tox

all environments in parallel

$ tox -p

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapedict-0.2.0.tar.gz (2.4 kB view hashes)

Uploaded Source

Built Distribution

scrapedict-0.2.0-py3-none-any.whl (2.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page