Skip to main content

Parse entire web pages into JSON/Markdown.

Project description

Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, lines, gaps, spacing, hierachy/reading order; etc, to reconstruct the page into a clean JSON/Markdown format. Currently, the options are either:

  • BeautifulSoup; static only, messy.

  • Playwright; lower level, manual.

  • Rich Soup builds on Playwright to give the DX of BeautifalSoup but can render properly like Playwright.

Primarily intended for document-like pages; i.e: Microsoft Learn, whitepapers (PDF-like), Wiki-like sites. Best part is it uses the layout, not tags, and it's not static! It can extract from garbled DOMs with hundreds of divs and hydration from React and Astro islands and Tailwind, etc etc, perfectly fine.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rich_soup-0.1.0.tar.gz (1.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rich_soup-0.1.0-py3-none-any.whl (1.5 kB view details)

Uploaded Python 3

File details

Details for the file rich_soup-0.1.0.tar.gz.

File metadata

  • Download URL: rich_soup-0.1.0.tar.gz
  • Upload date:
  • Size: 1.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for rich_soup-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ecc56cbac05e0e7d56569b89ef6e3e92b12c12c486a5e66dd16d05cf5c860628
MD5 3c949b2d3dd85d5c02ef89617e109c99
BLAKE2b-256 33de5d524f9a34b03afc5c448b32cb804cd554fb36c2be502de12eb12b9e6131

See more details on using hashes here.

File details

Details for the file rich_soup-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rich_soup-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 1.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for rich_soup-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3eff46d93d465850dda46fe6f2afe687057bd2ab816a8eec221cd458595874c8
MD5 236318e086dc03a90fa869b8c80b9dfa
BLAKE2b-256 e445287a8752add771d774c9cc7c263b78ad5cde1ce2bcea60d0c4d417b86742

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page