Skip to main content

Turn any HTML page into a pydantic object

Project description

pyrser

Transform any HTML page into a Pydantic-based schema using pyrser. This tool allows you to easily extract data from HTML pages, including both static and dynamic content, and map it to structured Pydantic models.

Installation

By default, pyrser installs only the necessary libraries for parsing static HTML pages (those not requiring JavaScript execution). To parse dynamic pages (those that rely on JavaScript), additional dependencies are required.

Installation for Static HTML Only

pip install pyrser-ai

Installation for Both Static and Dynamic HTML

pip install pyrser-ai[full]
playwright install

Usage

pyrser leverages LlamaIndex under the hood to parse HTML content and automatically generate schemas. By default, it uses OpenAI’s gpt-4o-mini model, but you can customize the model by passing the model parameter to the extractor or configuring your own LlamaIndex instance.

You can also define HTML tags to exclude from parsing by providing a list via the tags_to_remove parameter in the parse function. If no list is specified, the default set of tags will be ignored.

Default Tags Excluded

TAGS_TO_REMOVE = [
    "img",
    "head",
    "button",
    "svg",
    "style",
    "iframe",
    "header",
    "aside",
    "footer",
    "nav",
    "form",
    "link",
    "noscript",
    "input",
    "textarea",
    "menu",
    "track",
    "canvas",
    "video",
    "audio",
    "source",
]

Example Usages

Parsing Static HTML with the default configuration

import aiohttp

from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel


class MyModel(BaseModel):
    title: str
    description: str


async def main():
    async with aiohttp.ClientSession() as session:
        extractor = LlamaIndexExtractor()
        parser = StaticHTMLParser(extractor=extractor, http_client=session)

        output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Static HTML with a Custom Model

import aiohttp

from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel


class MyModel(BaseModel):
    title: str
    description: str


async def main():
    async with aiohttp.ClientSession() as session:
        extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
        parser = StaticHTMLParser(extractor=extractor, http_client=session)

        output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Static HTML with a Custom LLM Instance

import aiohttp

from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
from llama_index.llms.anthropic import Anthropic


class MyModel(BaseModel):
    title: str
    description: str


async def main():
    llm = Anthropic(model="claude-3-sonnet-20240229")

    async with aiohttp.ClientSession() as session:
        extractor = LlamaIndexExtractor(llm=llm)
        parser = StaticHTMLParser(extractor=extractor, http_client=session)

        output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Dynamic HTML with a Custom Model

from pyrser_ai.core.parsers.dynamic_html_parser import DynamicHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel


class MyModel(BaseModel):
    title: str
    description: str


async def main():
    extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
    parser = DynamicHTMLParser(extractor=extractor)

    output_model = await parser.parse("https://www.example.com", MyModel)

Additional Information

  • Model Customization: You can specify any supported LLM model in LlamaIndexExtractor by passing the model name to the model parameter.
  • Excluded Tags: To change the HTML tags excluded during parsing, provide a custom tags_to_remove list. By default, common non-content tags (e.g., img, button, style) are excluded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrser_ai-0.1.2.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

pyrser_ai-0.1.2-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pyrser_ai-0.1.2.tar.gz.

File metadata

  • Download URL: pyrser_ai-0.1.2.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.1

File hashes

Hashes for pyrser_ai-0.1.2.tar.gz
Algorithm Hash digest
SHA256 2618c7211661c6b6b4634193d1ec545d3d7b01dcd38456fcd89e6b716bd246c9
MD5 82f6f1db851529b199973a1c0d7c274c
BLAKE2b-256 a96ade54bf7f0ae99b646d5dd47698049dac02235cf35548daae3b81d5fb0c63

See more details on using hashes here.

File details

Details for the file pyrser_ai-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pyrser_ai-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c5c445ebbd9b7cf1c05b0e5e1cfd294d922c0d5135aa85148e845f992a8a671d
MD5 e79e42079336f570c202531579b9796a
BLAKE2b-256 80b6a7c806ea06671bcfb6971119e120b3b8960896742c7f57ca3b6cbf646a91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page