Skip to main content

No project description provided

Project description

🤖 Oxy® Parser

Parse HTMLs automatically by only describing the Pydantic model.

Discord

Oxy® Parser does the heavy lifting of parsing HTMLs for you. It uses Pydantic models to describe the structure of the HTML and then automatically parses the HTMLs into the Pydantic models.:

  • Describe a Pydantic model of your expected HTML structure
  • Pass the URL or HTML to Oxy® Parser together with the Pydantic model
  • Oxy® Parser will parse the HTML and return the parsed data as Pydantic models
  • Oxy® Parser will also cache the selectors for later re-use, so you don't need to call OpenAI's API every time you want to parse the same HTML

Supported cache backends:

  • Memory
  • File
  • Redis

See the flowchart for detailed view

flowchart.png

Installation

pip install oxyparser

Supported LLMs

This project uses LiteLLM please refer to the documentation to see the supported LLMs: https://docs.litellm.ai/docs/providers

Usage

You will need to setup an .env file with the following variables:

LLM_API_KEY: Can be either OpenAI key, Claude key or any other LLM provider key. (See LiteLLM docs above for full info on supported LLMs)

LLM_MODEL: The model you want to use. (See LiteLLM docs above for full info on supported LLMs)

Scraper keys: they're optional, but if not provided, you will need to scrape and pass the HTMLs to Oxy® Parser yourself. However, we highly suggest to use Oxylabs scraper since it will remove the hassle of getting HTMLs and will also provide you with a lot of other benefits like rotating IPs, handling captchas, bypassing blocks.

Click here to sign up for Oxylabs free trial: https://dashboard.oxylabs.io/en/registration?productToBuy=SCRAPI_WEB

OXYLABS_SCRAPER_USER: Your Oxylabs scraper user (optional) OXYLABS_SCRAPER_PASSWORD: Your Oxylabs scraper password (optional)

LLM_API_KEY=your_openai_api_key
LLM_MODEL=gpt-3.5-turbo
OXYLABS_SCRAPER_USER=your_oxylabs_scraper_user  # optional
OXYLABS_SCRAPER_PASSWORD=your_oxylabs_scraper_pass  # optional

Then you can use the following code to parse the website into structured data:

For full examples see the examples directory.

from pydantic import BaseModel
from oxyparser.oxyparser import OxyParser

class JobItem(BaseModel):
    title: str
    recruiter_name: str
    location: str
    description: str


# this page might expire
# if it does, please replace it with a new one
# https://career.oxylabs.io
# also if you're a python dev and looking for job, hit us up!
URL: str = "https://career.oxylabs.io/job/813b9ac5/python-developer-mid-senior/"


async def main() -> None:
    parser = OxyParser()
    job_item = await parser.parse(URL, JobItem)
    print(job_item)


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

If you have a HTML string instead of a URL, you can pass it to the parser as well like so:

parser = OxyParser()

html = "<html><body>" "<h1>John</h1>" "<h2>Smith</h2>" "<p>Svitrigailos st.</p>" "<span>2 years old</span></body></html>"
url = "https://example.com"  # url is needed to cache selectors
parsed_item = await parser.parse(url=url, model=JobItem, html=html)
print(job_item)

Known Issues

There are know nuances where sometimes the extracted xpath fails to extract some data. For example where description is very long or nested by many elements. In such cases for now we recommend manually editing the selectors in cache. These shouldn't be an often case though. Create an issue if you encounter any other issues.

Contributing

We welcome all contributions. To contribute - clone the repo, create a new branch, make your changes and create a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oxyparser-0.1.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

oxyparser-0.1.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file oxyparser-0.1.0.tar.gz.

File metadata

  • Download URL: oxyparser-0.1.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-25-generic

File hashes

Hashes for oxyparser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7da10bb62c9af3f84b4c00a259d011457cc1fe7ae1b3c3e1bd7b4ec402a0aaf8
MD5 1324eb414140a272d7758062e34fd4db
BLAKE2b-256 a5adb81815fa679e6b15e4e834b3a76fbd64c145097adfacb3df654b2bbb6ad4

See more details on using hashes here.

File details

Details for the file oxyparser-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: oxyparser-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.10.12 Linux/6.5.0-25-generic

File hashes

Hashes for oxyparser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 923e78a17d28a71eb44b9e84c990c42504ef882edf81648387e3c5de40c924bf
MD5 bb9d01c05616de26ec587b1aed807cd9
BLAKE2b-256 0f7b7fd0924e1703aec14b1258b3cceaecace4fd47062c8c3a16eb336ef4edd3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page