Skip to main content

Lightweight library for scraping web-sites with LLMs

Project description

📦 Parsera

Website Downloads

Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.

Why Parsera?

Because it's simple and lightweight, with minimal token use which boosts speed and reduces expenses.

Installation

pip install parsera
playwright install

Basic usage

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"

Next, you can run a basic version that uses gpt-4o-mini

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scrapper = Parsera()
result = scrapper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

import nest_asyncio
nest_asyncio.apply()

Or instead of calling run method use async arun.

Run with custom model

You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:

import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_GPT_BASE_URL"),
    openai_api_version="2023-05-15",
    deployment_name=os.getenv("AZURE_GPT_DEPLOYMENT_NAME"),
    openai_api_key=os.getenv("AZURE_GPT_API_KEY"),
    openai_api_type="azure",
    temperature=0.0,
)

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsera-0.1.3.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsera-0.1.3-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file parsera-0.1.3.tar.gz.

File metadata

  • Download URL: parsera-0.1.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.3.tar.gz
Algorithm Hash digest
SHA256 002898a9349b566d8bed66346970e5fbdfa268790f02f99b72fa03ddffffa106
MD5 e6f9dbc5f4810e37abd75c8d648dd2ff
BLAKE2b-256 5e10cc3016c09323a78620c77310b10142fa756f033da4797b27acc1ae5af098

See more details on using hashes here.

File details

Details for the file parsera-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: parsera-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7e184323a8d6f3d1d35579acc393c439f1dd2b76d81e5da08704eea028ce633a
MD5 1b531c9cc275a79e76e646c42455915d
BLAKE2b-256 1b0a24b35ebdca3983c2d97f8ac69219d47b6bd7b37231d1169aedfe34801ae2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page