A type-safe wrapper around BeautifulSoup and related HTML parsing utilities
Project description
typed-soup
A type-safe wrapper around BeautifulSoup and utilities for parsing HTML/XML with robust return types and error handling. Extracted from Open-Gov Crawlers.
Motivation
Before
Here are the first five errors. There are 16 in total.
error: Type of "rows" is partially unknown
Type of "rows" is "list[PageElement | Tag | NavigableString] | Unknown" (reportUnknownVariableType)
error: Type of "find_all" is partially unknown
Type of "find_all" is "Unknown | ((name: str | bytes | Pattern[str] | bool | ((Tag) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((Tag) -> bool)] | ElementFilter | None = None, attrs: Dict[str, str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]] = {}, recursive: bool = True, string: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)] | None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]) -> ResultSet[PageElement | Tag | NavigableString])" (reportUnknownMemberType)
error: Cannot access attribute "find_all" for class "PageElement"
Attribute "find_all" is unknown (reportAttributeAccessIssue)
error: Cannot access attribute "find_all" for class "NavigableString"
Attribute "find_all" is unknown (reportAttributeAccessIssue)
error: Type of "row" is partially unknown
Type of "row" is "PageElement | Tag | NavigableString | Unknown" (reportUnknownVariableType)
After
Changing one line of code to use TypedSoup instead of BeautifulSoup resolves the errors:
Installation
pip install typed-soup
Usage
If you're using Scrapy, you can use the from_response function to create a TypedSoup object from a Scrapy response:
from typed_soup import from_response
from scrapy.http.response.html import HtmlResponse
# Assume 'response' is an HtmlResponse object
soup = from_response(response)
# Find an element
element = soup.find("div", class_="example")
if element:
print(element.get_text())
# Find all elements
elements = soup.find_all("p")
for elem in elements:
print(elem.get_text())
Or, without Scrapy, you can explicity wrap a BeautifulSoup object in TypedSoup:
from typed_soup import TypedSoup
from bs4 import BeautifulSoup
soup = TypedSoup(BeautifulSoup(html_content, "html.parser"))
Supported Functions
I'm adding functions as I need them. If you have a request, please open an issue. These are the ones that I needed for a dozen spiders:
findfind_allget_textchildrentag_nameparentnext_siblingget_content_after_elementstring
And then these help create a TypedSoup object:
from_responseTypedSoup
License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file typed_soup-0.1.4.tar.gz.
File metadata
- Download URL: typed_soup-0.1.4.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34dda006d6ed45fe3aca6ba4ed9fa2fbdb549b7bf4f86d2fa7cd8cf3639c09ad
|
|
| MD5 |
b78365afcaba12d8eb96040e3ceefa22
|
|
| BLAKE2b-256 |
cd13477576e8f78bfc80abb3c680c42b1fe20933cdd921deb8d101b1370656ba
|
File details
Details for the file typed_soup-0.1.4-py3-none-any.whl.
File metadata
- Download URL: typed_soup-0.1.4-py3-none-any.whl
- Upload date:
- Size: 3.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11c26ef036cc2acf2550b3166845cbc8384e9ac6422431bb50f1b1637d705b84
|
|
| MD5 |
18518aa2d8c3c363fcd11bd2a5bf1b69
|
|
| BLAKE2b-256 |
ef61a921b22dfa85e2364e53ad461c570e61b75a6b2e695d1028e37dc9bf528e
|