Scrape HTML to dictionaries
Project description
Scrape HTML to dictionaries
Usage overview
- Create a dictionary with rules to extract the right information
- Create the extraction function with the rules dictionary
- Use the extraction function on a BeautifulSoup object
Examples
- To start let's
cook
the HTML to get a BeautifulSoup object. For this example the string insample
will be used as the target HTML - normally you would fetch the HTML with your favorite HTTP library or read from the disk. Thesoup
variable holds the BeautifulSoup object.
from pprint import pprint
import scrapedict as sd
sample = """<html>
<head>
<title>Page title</title>
</head>
<body>
<header>
<h1>Big Header</h1>
</header>
<article>
<p class="abstract">Paragraph of the article.</p>
<ol>
<li>
<span class='description'>first</span>
<a href="http://first.example.com">link</a>
</li>
<li>
<span class='description'>second</span>
<a href="http://second.example.com">link</a>
</li>
<li>
<span class='description'>third</span>
<a href="http://third.example.com">link</a>
</li>
</ol>
</article>
<footer><i>Page footer - <a href="http://example.com/">link</a></i></footer>
</body>
</html>"""
soup = sd.cook(sample)
Extract a single item
- Define a
rules
dictionary
rules = {
"header": sd.html("header"),
"article": sd.text(".abstract"),
"footer_link": sd.attr("footer a", "href"),
}
- Create an extractor function feeding it the rules
item_extractor
is a function that knows how to extract your item from any soup
item_extractor = sd.extract(rules)
- Use the extractor function feeding it the soup
item = item_extractor(soup)
pprint(item)
output:
{'article': '\nParagraph of the article.\n',
'footer_link': 'http://example.com/',
'header': <header>
<h1>Big Header</h1>
</header>}
Extract multiple items
- Define the
rules
to extract each item
rules = dict(description=sd.text(".description"), url=sd.attr("a", "href"))
- Create the extractor function passing in a selector where the rules should be applied
this selector should emmit multiple items
list_item_extractor = sd.extract_all("article ol li", rules)
- Use the extractor function feeding it the soup
items_list = list_item_extractor(soup)
pprint(items_list)
output:
[{'description': 'first', 'url': 'http://first.example.com'},
{'description': 'second', 'url': 'http://second.example.com'},
{'description': 'third', 'url': 'http://third.example.com'}]
Testing
For testing use tox
all environments in parallel
$ tox -p
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapedict-0.2.0.tar.gz
(2.4 kB
view hashes)
Built Distribution
Close
Hashes for scrapedict-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 632e0e57364b96450d614dc2b87d668eb217059b89a56eaec20863c6c54bc82c |
|
MD5 | 42f9945fbce55f13b36b1de683d4cd1d |
|
BLAKE2b-256 | 816789edb7779a709c8118fdfb3a3f36e23300d18ad0acd7fa8bedb2ac776dba |