Skip to main content

Lightweight library for scraping web-sites with LLMs

Project description

Parsera

Lightweight library for scraping web-sites with LLMs. You can test it on Parsera website.

Why Parsera?

Because it's simple and lightweight, with minimal token use it boosts speed and reduces expenses.

Installation

pip install parsera
playwright install

Basic usage

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"

Next you can run a basic version that uses gpt-4o-mini

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scrapper = Parsera()
result = scrapper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

import nest_asyncio
nest_asyncio.apply()

Or instead of calling run method use async arun.

Run with custom model

You can instantiate Parsera with any chat model supported by LangChain, for example, to run model from Azure:

import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_GPT_BASE_URL"),
    openai_api_version="2023-05-15",
    deployment_name=os.getenv("AZURE_GPT_DEPLOYMENT_NAME"),
    openai_api_key=os.getenv("AZURE_GPT_API_KEY"),
    openai_api_type="azure",
    temperature=0.0,
)

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsera-0.1.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsera-0.1.2-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file parsera-0.1.2.tar.gz.

File metadata

  • Download URL: parsera-0.1.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.2.tar.gz
Algorithm Hash digest
SHA256 68e0c93913c6ef306a848132022de23a3df44e91a6b1e2d2fa46d6da796e7494
MD5 682e15f2256ee3b338c462d4ef65bdb1
BLAKE2b-256 e9349f8da69f74af74d39cc3c550ba10d543fe343852bfd2db1ab3608553f6a6

See more details on using hashes here.

File details

Details for the file parsera-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: parsera-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 963e067646263ab386cc25dd28dd8919f21c36b67b451077edb42e3c6f796e00
MD5 6886f6f4d37d7f0e94097c2ffc3a929f
BLAKE2b-256 de22075c67cbb7210c11613cae3b6649490501312355f94e5173409abf061263

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page