Skip to main content

simplifies the process of designing a Python tool for extracting and displaying webpage content.

Project description

SoupCan : Your BeautifulSoup in a Can

SoupCan aims to simplify the process of designing a Python tool for extracting and displaying webpage content.

It builds on the wonderful library, Beautiful Soup, allowing you to leverage this library's powerful features in your tool.

All you need to do to design your tool with SoupCan is to define the kinds of content that you wish to extract, select which parts of the content you wish to display, and SoupCan will do the rest.

SoupCan is ideal for designing a tool that works in a Jupyter notebook, as SoupCan, out of the box, supports HTML rendering of content in notebook cells.

Prerequisites

To get started with SoupCan, you'll need to have:

  • some familiarity with HTML generally, and particularly with the HTML of the webpage that you wish to extract and display content from.

  • knowledge of the Beautiful Soup library, and especially its search method, find().

  • an understanding of object-oriented programming concepts, and how you apply them in Python.

Software Requirements

To use SoupCan in your tool, you'll need to have:

  • Python 3.6+; and
  • the Beautiful Soup library

For information on the various ways you can install the Beautiful Soup library, see this library's own documentation.

Installation

To install SoupCan, execute the following in your local or virtual Python environment:

pip install soupcan

If you don't already have the Beautiful Soup library installed in your environment, this command will install this library (from the PyPI repository) too.

Under the hood, the Beautiful Soup library, by default, uses the HTML parser that comes with the standard libray, html. If you wish to use a third-party parser, like lxml or html5lib, instead, you'll have to install them yourself, which you can do by adding lxml html5lub to the above command.

It is also up to you how you get the HTML content from a webpage. SoupCan is not a webscraper package, and so you'll have to implement those procedures yourself when designing your tool.

Finally, SoupCan does not include any Jupyter software as a dependency. You or your tool users will need to have Jupyter, if you wish to make the most of SoupCan's display features.

Basic example

Let's design a simple package with SoupCan, and apply it to a basic html document:

import soupcan as sc

class Paragraph(sc.Content):
    """Return a Content-typed object for <p></p> element"""
    _KIND = {"name": "p"}
    
    text = sc.Property(lambda self: self.text, doc="Return text of paragraph")
    
class Paragraphs(sc.Contents):
    """Return a Contents-typed object with Paragraph-typed object"""
    _CONTENT = Paragraph
    
class Body(sc.Content):
    """Return a Content-typed object for a <body></body> element"""
    _KIND = {"name": "body"}  

    from_string = sc.AltConstructor() 
    
    paragraphs = sc.Property(Paragraphs, doc="Return paragraphs")    

## example html document (originally used in the BeautifulSoup documentation)
    
html= """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

body = Body.from_string(html,'html.parser')

print(body.paragraphs[0].text)  # print The Dormouse's story as text

You could instead create a Body-typed object by initialising it with a BeautifulSoup object, like follows:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features='html.parser')
body2  = Body(soup)

However, the from_string() alternative-constructor method, implemented using SoupCan's AltConstructor class, does all of this for you, under the hood.

You can extend this example package:

  • by adding other Content-types, say for a Hyperlink class
class Hyerlink(sc.Content):  
    """Return a Content-type object for an <a></a> element"""    
    _KIND = {"name": "a"}
  • by creating separate a Content-type for more specific piece of content, say for a class=title paragraph:
class TitlePargraph(sc.Content):        
    """Return a Content-type object for a <p class='title'></p> element"""    
    _KIND = {"name": "p", 'class_':"title"}

Extend an existing Content type, by subclassing it and then adding
(say using the Property class):

class Link(Hyerlink):
    """Return a extented Content-type object for an <a></a> element."""
    
    href = sc.Property(lambda self: self.href, doc = "Return hyperlink reference")    
    text = sc.Property(lambda self: self.text, doc= "Return hyperlink text")

The Propery class is a (non-data) descriptor class. It works much like a property method:

class ExtendedTitlePargraph(TitlePargraph):
    
    @property
    def text(self):
        "Return text string"
        return self._element.text

In the above, the self._element is the underlying BeautifulSoup object at the <p>element (with the "class=title" attribute).

License

BSD 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

soupcan-0.0.2.dev0.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

soupcan-0.0.2.dev0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file soupcan-0.0.2.dev0.tar.gz.

File metadata

  • Download URL: soupcan-0.0.2.dev0.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.3

File hashes

Hashes for soupcan-0.0.2.dev0.tar.gz
Algorithm Hash digest
SHA256 b542beefc457ef86f850256bc3690ceb136a4efb831fd59ee09435c9df4d5c50
MD5 5f9f8e02ec7c4461adf0d9b17caaf3c6
BLAKE2b-256 8b624156035505a19a6dbe499a7d4a815db948c6f59a47210da3c183a1748fa2

See more details on using hashes here.

File details

Details for the file soupcan-0.0.2.dev0-py3-none-any.whl.

File metadata

  • Download URL: soupcan-0.0.2.dev0-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.3

File hashes

Hashes for soupcan-0.0.2.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 5220fb147f4833d9597e9d0ace6e1990dcfadd5cb42875e227f74637b4b0519b
MD5 beb28d308d7024a1b24ca5dbdf4c21c9
BLAKE2b-256 e0848055b4a09779967fbf84f5e3fe12507d62301fc1290705341fc0c4384595

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page