Turn any HTML page into a pydantic object
Project description
pyrser
Transform any HTML page into a Pydantic-based schema using pyrser. This tool allows you to easily extract data from HTML pages, including both static and dynamic content, and map it to structured Pydantic models.
Installation
By default, pyrser installs only the necessary libraries for parsing static HTML pages (those not requiring JavaScript execution). To parse dynamic pages (those that rely on JavaScript), additional dependencies are required.
Installation for Static HTML Only
pip install pyrser-ai
Installation for Both Static and Dynamic HTML
pip install pyrser-ai[full]
playwright install
Usage
pyrser leverages LlamaIndex under the hood to parse HTML content and automatically generate schemas. By default, it uses OpenAI’s gpt-4o-mini model, but you can customize the model by passing the model parameter to the extractor or configuring your own LlamaIndex instance.
You can also define HTML tags to exclude from parsing by providing a list via the tags_to_remove parameter in the parse function. If no list is specified, the default set of tags will be ignored.
Default Tags Excluded
TAGS_TO_REMOVE = [
"img",
"head",
"button",
"svg",
"style",
"iframe",
"header",
"aside",
"footer",
"nav",
"form",
"link",
"noscript",
"input",
"textarea",
"menu",
"track",
"canvas",
"video",
"audio",
"source",
]
Example Usages
Parsing Static HTML with the default configuration
import aiohttp
from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
class MyModel(BaseModel):
title: str
description: str
async def main():
async with aiohttp.ClientSession() as session:
extractor = LlamaIndexExtractor()
parser = StaticHTMLParser(extractor=extractor, http_client=session)
output_model = await parser.parse("https://www.example.com", MyModel)
Parsing Static HTML with a Custom Model
import aiohttp
from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
class MyModel(BaseModel):
title: str
description: str
async def main():
async with aiohttp.ClientSession() as session:
extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
parser = StaticHTMLParser(extractor=extractor, http_client=session)
output_model = await parser.parse("https://www.example.com", MyModel)
Parsing Static HTML with a Custom LLM Instance
import aiohttp
from pyrser_ai.core.parsers.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
from llama_index.llms.anthropic import Anthropic
class MyModel(BaseModel):
title: str
description: str
async def main():
llm = Anthropic(model="claude-3-sonnet-20240229")
async with aiohttp.ClientSession() as session:
extractor = LlamaIndexExtractor(llm=llm)
parser = StaticHTMLParser(extractor=extractor, http_client=session)
output_model = await parser.parse("https://www.example.com", MyModel)
Parsing Dynamic HTML with a Custom Model
from pyrser_ai.core.parsers.dynamic_html_parser import DynamicHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
class MyModel(BaseModel):
title: str
description: str
async def main():
extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
parser = DynamicHTMLParser(extractor=extractor)
output_model = await parser.parse("https://www.example.com", MyModel)
Additional Information
- Model Customization: You can specify any supported LLM model in LlamaIndexExtractor by passing the model name to the model parameter.
- Excluded Tags: To change the HTML tags excluded during parsing, provide a custom tags_to_remove list. By default, common non-content tags (e.g., img, button, style) are excluded.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyrser_ai-0.1.2.tar.gz
.
File metadata
- Download URL: pyrser_ai-0.1.2.tar.gz
- Upload date:
- Size: 9.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2618c7211661c6b6b4634193d1ec545d3d7b01dcd38456fcd89e6b716bd246c9 |
|
MD5 | 82f6f1db851529b199973a1c0d7c274c |
|
BLAKE2b-256 | a96ade54bf7f0ae99b646d5dd47698049dac02235cf35548daae3b81d5fb0c63 |
File details
Details for the file pyrser_ai-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: pyrser_ai-0.1.2-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5c445ebbd9b7cf1c05b0e5e1cfd294d922c0d5135aa85148e845f992a8a671d |
|
MD5 | e79e42079336f570c202531579b9796a |
|
BLAKE2b-256 | 80b6a7c806ea06671bcfb6971119e120b3b8960896742c7f57ca3b6cbf646a91 |