Skip to main content

Turn anything into a pydantic object

Project description

pyrser

Transform any HTML page or Document into a Pydantic-based schema using pyrser. This tool allows you to easily extract data from a HTML, PDF, DOC, DOCX, TXT, etc pages, including both static and dynamic content, and map it to structured Pydantic models.

Installation

By default, pyrser installs only the necessary libraries for parsing static HTML pages (those not requiring JavaScript execution) and documents. To parse dynamic pages (those that rely on JavaScript), additional dependencies are required.

Installation for Static HTML Only

pip install pyrser-ai

Installation for Both Static and Dynamic HTML

pip install pyrser-ai[full]
playwright install

Requirements

To use the default configuration make sure that you have a OpenAI account and a LlamaParse account.

Usage

pyrser leverages LlamaIndex and LlamaParse under the hood to parse documents and HTML content and automatically generate schemas. By default, it uses OpenAI’s gpt-4o-mini model, but you can customize the model by passing the model parameter to the extractor or configuring your own LlamaIndex instance.

You can also define HTML tags to exclude from parsing by providing a list via the tags_to_remove parameter in the parse function. If no list is specified, the default set of tags will be ignored.

Default HTML Tags Excluded

TAGS_TO_REMOVE = [
    "img",
    "head",
    "button",
    "svg",
    "style",
    "iframe",
    "header",
    "aside",
    "footer",
    "nav",
    "form",
    "link",
    "noscript",
    "input",
    "textarea",
    "menu",
    "track",
    "canvas",
    "video",
    "audio",
    "source",
]

Example Usages

Parsing a PDF Document with the default configuration

from pydantic import BaseModel, Field
from datetime import datetime
import io
from pyrser_ai.core.parsers.file.file_parser import FileParser

class MyModel(BaseModel):
    cnpj: str = Field(description="The CNPJ of the company")
    due_date: datetime = Field(description="The due date of the document")
    total: float = Field(description="The total value of the document")
    

async def main():
  with open("doc_pdf.pdf", "rb") as f:
      f_bytes = io.BytesIO(f.read())
      f_bytes.seek(0)
  
      result = await FileParser().parse_document(document=f_bytes, file_name="impostos.pdf", output_model=MyModel)
      print(result)

Parsing Static HTML with the default configuration

import aiohttp

from pyrser_ai.core.parsers.html.static_html_parser import StaticHTMLParser
from pydantic import BaseModel


class MyModel(BaseModel):
  title: str
  description: str


async def main():
  async with aiohttp.ClientSession() as session:
    parser = StaticHTMLParser(http_client=session)

    output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Static HTML with a Custom Model

import aiohttp

from pyrser_ai.core.parsers.html.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel


class MyModel(BaseModel):
  title: str
  description: str


async def main():
  async with aiohttp.ClientSession() as session:
    extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
    parser = StaticHTMLParser(extractor=extractor, http_client=session)

    output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Static HTML with a Custom LLM Instance

import aiohttp

from pyrser_ai.core.parsers.html.static_html_parser import StaticHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel
from llama_index.llms.anthropic import Anthropic


class MyModel(BaseModel):
  title: str
  description: str


async def main():
  llm = Anthropic(model="claude-3-sonnet-20240229")

  async with aiohttp.ClientSession() as session:
    extractor = LlamaIndexExtractor(llm=llm)
    parser = StaticHTMLParser(extractor=extractor, http_client=session)

    output_model = await parser.parse("https://www.example.com", MyModel)

Parsing Dynamic HTML with a Custom Model

from pyrser_ai.core.parsers.html.dynamic_html_parser import DynamicHTMLParser
from pyrser_ai.core.extractors.llama_index_extractor import LlamaIndexExtractor
from pydantic import BaseModel


class MyModel(BaseModel):
  title: str
  description: str


async def main():
  extractor = LlamaIndexExtractor(model="gpt-3.5-turbo")
  parser = DynamicHTMLParser(extractor=extractor)

  output_model = await parser.parse("https://www.example.com", MyModel)

Additional Information

  • Some environment variables are needed in order to use the default configuration
    • OPENAI_API_KEY=
    • LLAMA_CLOUD_API_KEY=
  • Model Customization: You can specify any supported LLM model in LlamaIndexExtractor by passing the model name to the model parameter.
  • Excluded Tags: To change the HTML tags excluded during parsing, provide a custom tags_to_remove list. By default, common non-content tags (e.g., img, button, style) are excluded.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrser_ai-0.2.1.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyrser_ai-0.2.1-py3-none-any.whl (14.8 kB view details)

Uploaded Python 3

File details

Details for the file pyrser_ai-0.2.1.tar.gz.

File metadata

  • Download URL: pyrser_ai-0.2.1.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.1

File hashes

Hashes for pyrser_ai-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8147d09479350bced3cc7279c6e3521088a54e61ec9705ff761466f2df81cdc5
MD5 d0006c4936a780ce0300210a6d6a587a
BLAKE2b-256 2e48a0f99e6d2a2c5463df45436e494c5ba81ab9c7a9ade2c7617c47639fce34

See more details on using hashes here.

File details

Details for the file pyrser_ai-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: pyrser_ai-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 14.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.1

File hashes

Hashes for pyrser_ai-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f7886f33fe98685df3ea264543d371f6ff476a838a4aa47845feb3c6104a3178
MD5 35b698c09d7eeea60b0c3d18a14d7d5d
BLAKE2b-256 f28bb2f493cbc02f37bf8fb6d985ae5e8daff4976c0086d9c04401586ecb1231

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page