A wrapper for BeautifulSoup4 that restores the ability to work with HTML fragments
Project description
This is a thin wrapper for BeautifulSoup4 that restores the ability to work with HTML fragments. For example:
from bs4 import BeautifulSoup
from fragmentsoup import FragmentSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', features='html5lib')
soup
# <html><head></head><body><b class="boldest">Extremely bold</b></body></html>
# Note that the fragment is wrapped to make it a valid html document
soup = FragmentSoup('<b class="boldest">Extremely bold</b>', features='html5lib')
soup
# <b class="boldest">Extremely bold</b>
# FragmentSoup keeps it as a fragment
In almost all cases, a FragmentSoup instance should work exactly the same as a BeautifulSoup instance. The one notable exception is that calling ‘wrap’ on a Fragment itself will wrap the entire Fragment and return itself:
from fragmentsoup import FragmentSoup
soup = FragmentSoup('<b class="boldest">Extremely bold</b>', features='html5lib')
soup
# <b class="boldest">Extremely bold</b>
soup.wrap(soup.new_tag('div')
# <div><b class="boldest">Extremely bold</b></div>
If you wrap a subelement, it returns a BeautifulSoup “Tag” instance. If you want to use the returned wrapped subelement as a Fragment, you will need to wrap the returned Tag instance to use it as a fragment:
from fragmentsoup import FragmentSoup
soup = FragmentSoup('<div><b class="boldest">Extremely bold</b></div>', features='html5lib')
subdocument = soup.b.wrap(soup.new_tag('h1'))
subdocument
# <h1><b class="boldest">Extremely bold</b></h1>
type(subdocument)
# <class 'bs4.element.Tag'>
subdocument = FragmentSoup(subdocument)
type(subdocument)
# <class 'fragmentsoup.FragmentSoup'>
This also applies to Tags returned as a result of unwrapping a part of the document.
What if I pass in a well-formed document?
If you pass in a full document (which is defined as starting with a <!DOCTYPE> or <html> tag), then FragmentSoup assumes that the resulting tree is well-formed and it acts exactly as if it were a regular BeautifulSoup instance. It will not allow you to wrap the well-formed document with a tag - it will raise a ValueError (just as regular BeautifulSoup does).
How does it work?
FragmentSoup wraps the incoming snippet in a dummy <fragmentsoup> tag that it removes (along with all context outside the <fragmentsoup> tag before rendering. Otherwise, it defers any attribute accesses to an internal BeautifulSoup instance.
Bugs
Aside from the differences noted above, any difference in behavior from regular BeautifulSoup4 is a bug. Reports and patches welcome.
Change Log
Version History
- 0.6.0
Initial release to Github and PyPI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file fragmentsoup-0.6.1-py3-none-any.whl
.
File metadata
- Download URL: fragmentsoup-0.6.1-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.4.2 requests/2.25.1 setuptools/51.3.3 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.9.1+
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ea894a2b52ce2b8efdb8e2478f948b57629a8ea9f435cd8328272150d94edb6a |
|
MD5 | 4f8cac6707fe8cc840679ad1a9de9644 |
|
BLAKE2b-256 | f8942a5d403475873dafb5ee58d2eedfdb0c6e8262b1537403e2a6ccceaa375f |