Scrape HTML to dictionaries

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Scrape HTML to dictionaries

Usage overview

Create a dictionary with rules to extract the right information
Create the extraction function with the rules dictionary
Use the extraction function on a BeautifulSoup object

Examples

To start let's cook the HTML to get a BeautifulSoup object. For this example the string in sample will be used as the target HTML - normally you would fetch the HTML with your favorite HTTP library or read from the disk. The soup variable holds the BeautifulSoup object.

from pprint import pprint
import scrapedict as sd


sample = """<html>
<head>
  <title>Page title</title>
</head>
<body>
  <header>
    <h1>Big Header</h1>
  </header>
  <article>
    <p class="abstract">Paragraph of the article.</p>
    <ol>
      <li>
        <span class='description'>first</span>
        <a href="http://first.example.com">link</a>
      </li>
      <li>
        <span class='description'>second</span>
        <a href="http://second.example.com">link</a>
      </li>
      <li>
        <span class='description'>third</span>
        <a href="http://third.example.com">link</a>
      </li>
    </ol>
  </article>
  <footer><i>Page footer - <a href="http://example.com/">link</a></i></footer>
</body>
</html>"""

soup = sd.cook(sample)

Extract a single item

Define a rules dictionary

rules = {
    "header": sd.html("header"),
    "article": sd.text(".abstract"),
    "footer_link": sd.attr("footer a", "href"),
}

Create an extractor function feeding it the rules

item_extractor is a function that knows how to extract your item from any soup

item_extractor = sd.extract(rules)

Use the extractor function feeding it the soup

item = item_extractor(soup)

pprint(item)

output:

{'article': '\nParagraph of the article.\n',
 'footer_link': 'http://example.com/',
 'header': <header>
<h1>Big Header</h1>
</header>}

Extract multiple items

Define the rules to extract each item

rules = dict(description=sd.text(".description"), url=sd.attr("a", "href"))

Create the extractor function passing in a selector where the rules should be applied

this selector should emmit multiple items

list_item_extractor = sd.extract_all("article ol li", rules)

Use the extractor function feeding it the soup

items_list = list_item_extractor(soup)

pprint(items_list)

output:

[{'description': 'first', 'url': 'http://first.example.com'},
 {'description': 'second', 'url': 'http://second.example.com'},
 {'description': 'third', 'url': 'http://third.example.com'}]

Testing

For testing use tox

all environments in parallel

$ tox -p

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.0

Nov 16, 2023

0.2.1

Nov 14, 2023

This version

0.2.0

Nov 13, 2023

0.1.1

Oct 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapedict-0.2.0.tar.gz (2.4 kB view hashes)

Uploaded Nov 13, 2023 Source

Built Distribution

scrapedict-0.2.0-py3-none-any.whl (2.7 kB view hashes)

Uploaded Nov 13, 2023 Python 3

Hashes for scrapedict-0.2.0.tar.gz

Hashes for scrapedict-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`bbabab916e5360fbbded16d0dcfe51e30be31840db13f3ddf2b0642d4b2eef50`
MD5	`799ab88a5c2b282db3403576b973c7a3`
BLAKE2b-256	`a8c621826dec076b85f46a19af6edbd9401d5f7b8836cd00a6c147306a4fc6ee`

Hashes for scrapedict-0.2.0-py3-none-any.whl

Hashes for scrapedict-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`632e0e57364b96450d614dc2b87d668eb217059b89a56eaec20863c6c54bc82c`
MD5	`42f9945fbce55f13b36b1de683d4cd1d`
BLAKE2b-256	`816789edb7779a709c8118fdfb3a3f36e23300d18ad0acd7fa8bedb2ac776dba`