simplifies the process of designing a Python tool for extracting and displaying webpage content.
Project description
SoupCan : Your BeautifulSoup in a Can
SoupCan aims to simplify the process of designing a Python tool for extracting and displaying webpage content.
It builds on the wonderful library, Beautiful Soup, allowing you to leverage this library's powerful features in your tool.
All you need to do to design your tool with SoupCan is to define the kinds of content that you wish to extract, select which parts of the content you wish to display, and SoupCan will do the rest.
SoupCan is ideal for designing a tool that works in a Jupyter notebook, as SoupCan, out of the box, supports HTML rendering of content in notebook cells.
Prerequisites
To get started with SoupCan, you'll need to have:
-
some familiarity with HTML generally, and particularly with the HTML of the webpage that you wish to extract and display content from.
-
knowledge of the Beautiful Soup library, and especially its search method,
find()
. -
an understanding of object-oriented programming concepts, and how you apply them in Python.
Software Requirements
To use SoupCan in your tool, you'll need to have:
- Python 3.6+; and
- the Beautiful Soup library
For information on the various ways you can install the Beautiful Soup library, see this library's own documentation.
Installation
To install SoupCan, execute the following in your local or virtual Python environment:
pip install soupcan
If you don't already have the Beautiful Soup library installed in your environment, this command will install this library (from the PyPI repository) too.
Under the hood, the Beautiful Soup library, by default, uses the HTML parser that comes with the standard libray, html. If you wish to use a third-party parser, like lxml or html5lib, instead, you'll have to install them yourself, which you can do by adding lxml html5lub
to the above command.
It is also up to you how you get the HTML content from a webpage. SoupCan is not a webscraper package, and so you'll have to implement those procedures yourself when designing your tool.
Finally, SoupCan does not include any Jupyter software as a dependency. You or your tool users will need to have Jupyter, if you wish to make the most of SoupCan's display features.
Basic example
Let's design a simple package with SoupCan, and apply it to a basic html document:
import soupcan as sc
class Paragraph(sc.Content):
"""Return a Content-typed object for <p></p> element"""
_KIND = {"name": "p"}
text = sc.Property(lambda self: self.text, doc="Return text of paragraph")
class Paragraphs(sc.Contents):
"""Return a Contents-typed object with Paragraph-typed object"""
_CONTENT = Paragraph
class Body(sc.Content):
"""Return a Content-typed object for a <body></body> element"""
_KIND = {"name": "body"}
from_string = sc.AltConstructor()
paragraphs = sc.Property(Paragraphs, doc="Return paragraphs")
## example html document (originally used in the BeautifulSoup documentation)
html= """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
body = Body.from_string(html,'html.parser')
print(body.paragraphs[0].text) # print The Dormouse's story as text
You could instead create a Body
-typed object by initialising it with a BeautifulSoup object, like follows:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features='html.parser')
body2 = Body(soup)
However, the from_string()
alternative-constructor method, implemented using SoupCan's AltConstructor
class, does all of this for you, under the hood.
You can extend this example package:
- by adding other Content-types, say for a Hyperlink class
class Hyerlink(sc.Content):
"""Return a Content-type object for an <a></a> element"""
_KIND = {"name": "a"}
- by creating separate a Content-type for more specific piece of content, say for a
class=title
paragraph:
class TitlePargraph(sc.Content):
"""Return a Content-type object for a <p class='title'></p> element"""
_KIND = {"name": "p", 'class_':"title"}
Extend an existing Content type, by subclassing it and then adding
(say using the Property class):
class Link(Hyerlink):
"""Return a extented Content-type object for an <a></a> element."""
href = sc.Property(lambda self: self.href, doc = "Return hyperlink reference")
text = sc.Property(lambda self: self.text, doc= "Return hyperlink text")
The Propery
class is a (non-data) descriptor class. It works much like a property method:
class ExtendedTitlePargraph(TitlePargraph):
@property
def text(self):
"Return text string"
return self._element.text
In the above, the self._element
is the underlying BeautifulSoup object at the <p>
element (with the "class=title" attribute).
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for soupcan-0.0.2.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5220fb147f4833d9597e9d0ace6e1990dcfadd5cb42875e227f74637b4b0519b |
|
MD5 | beb28d308d7024a1b24ca5dbdf4c21c9 |
|
BLAKE2b-256 | e0848055b4a09779967fbf84f5e3fe12507d62301fc1290705341fc0c4384595 |