Skip to main content

A type-safe wrapper around BeautifulSoup and related HTML parsing utilities

Project description

typed-soup

A type-safe wrapper around BeautifulSoup and utilities for parsing HTML/XML with robust return types and error handling. Extracted from Open-Gov Crawlers.

Motivation

Before

Before

Here are the first five errors. There are 16 in total.

  error: Type of "rows" is partially unknown
    Type of "rows" is "list[PageElement | Tag | NavigableString] | Unknown" (reportUnknownVariableType)
  error: Type of "find_all" is partially unknown
    Type of "find_all" is "Unknown | ((name: str | bytes | Pattern[str] | bool | ((Tag) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((Tag) -> bool)] | ElementFilter | None = None, attrs: Dict[str, str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]] = {}, recursive: bool = True, string: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)] | None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: str | bytes | Pattern[str] | bool | ((str) -> bool) | Iterable[str | bytes | Pattern[str] | bool | ((str) -> bool)]) -> ResultSet[PageElement | Tag | NavigableString])" (reportUnknownMemberType)
  error: Cannot access attribute "find_all" for class "PageElement"
    Attribute "find_all" is unknown (reportAttributeAccessIssue)
  error: Cannot access attribute "find_all" for class "NavigableString"
    Attribute "find_all" is unknown (reportAttributeAccessIssue)
  error: Type of "row" is partially unknown
    Type of "row" is "PageElement | Tag | NavigableString | Unknown" (reportUnknownVariableType)

After

Changing one line of code to use TypedSoup instead of BeautifulSoup resolves the errors:

After

Installation

pip install typed-soup

Usage

If you're using Scrapy, you can use the from_response function to create a TypedSoup object from a Scrapy response:

from typed_soup import from_response
from scrapy.http.response.html import HtmlResponse

# Assume 'response' is an HtmlResponse object
soup = from_response(response)

# Find an element
element = soup.find("div", class_="example")
if element:
    print(element.get_text())

# Find all elements
elements = soup.find_all("p")
for elem in elements:
    print(elem.get_text())

Or, without Scrapy, you can explicity wrap a BeautifulSoup object in TypedSoup:

from typed_soup import TypedSoup
from bs4 import BeautifulSoup

soup = TypedSoup(BeautifulSoup(html_content, "html.parser"))

Supported Functions

I'm adding functions as I need them. If you have a request, please open an issue. These are the ones that I needed for a dozen spiders:

  • find
  • find_all
  • get_text
  • children
  • tag_name
  • parent
  • next_sibling
  • get_content_after_element
  • string

And then these help create a TypedSoup object:

  • from_response
  • TypedSoup

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

typed_soup-0.1.4.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

typed_soup-0.1.4-py3-none-any.whl (3.8 kB view details)

Uploaded Python 3

File details

Details for the file typed_soup-0.1.4.tar.gz.

File metadata

  • Download URL: typed_soup-0.1.4.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for typed_soup-0.1.4.tar.gz
Algorithm Hash digest
SHA256 34dda006d6ed45fe3aca6ba4ed9fa2fbdb549b7bf4f86d2fa7cd8cf3639c09ad
MD5 b78365afcaba12d8eb96040e3ceefa22
BLAKE2b-256 cd13477576e8f78bfc80abb3c680c42b1fe20933cdd921deb8d101b1370656ba

See more details on using hashes here.

File details

Details for the file typed_soup-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: typed_soup-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 3.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.3 Darwin/24.4.0

File hashes

Hashes for typed_soup-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 11c26ef036cc2acf2550b3166845cbc8384e9ac6422431bb50f1b1637d705b84
MD5 18518aa2d8c3c363fcd11bd2a5bf1b69
BLAKE2b-256 ef61a921b22dfa85e2364e53ad461c570e61b75a6b2e695d1028e37dc9bf528e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page