Lightweight library for scraping web-sites with LLMs

These details have not been verified by PyPI

Project description

📦 Parsera

Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.

Why Parsera?

Because it's simple and lightweight, with minimal token use which boosts speed and reduces expenses.

Installation

pip install parsera
playwright install

Basic usage

If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable. You can do this from python with:

import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"

Next, you can run a basic version that uses gpt-4o-mini

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scraper = Parsera()
result = scraper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Using proxy

You can use serve the traffic via proxy server when calling run method:

proxy_settings = {
    "server": "https://1.2.3.4:5678",
    "username": <PROXY_USERNAME>,
    "password": <PROXY_PASSWORD>,
}
result = scrapper.run(url=url, elements=elements, proxy_settings=proxy_settings)

Run with custom model

You can instantiate Parsera with any chat model supported by LangChain, for example, to run the model from Azure:

import os
from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_endpoint=os.getenv("AZURE_GPT_BASE_URL"),
    openai_api_version="2023-05-15",
    deployment_name=os.getenv("AZURE_GPT_DEPLOYMENT_NAME"),
    openai_api_key=os.getenv("AZURE_GPT_API_KEY"),
    openai_api_type="azure",
    temperature=0.0,
)

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)

Run local model with HuggingFace `Trasformers`

Currently, we only support models that include a system token

You should install Transformers with either pytorch (recommended) or TensorFlow 2.0

Transformers Installation Guide

example:

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from parsera.engine.model import HuggingFaceModel
from parsera import Parsera

# Define the URL and elements to scrape
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}

# Initialize model with transformers pipeline
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-128k-instruct", trust_remote_code=True)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=5000)

# Initialize HuggingFaceModel
llm = HuggingFaceModel(pipeline=pipe)

# Scrapper with HuggingFace model
scrapper = Parsera(model=llm)
result = scrapper.run(url=url, elements=elements)

Using different extractor types

By default a tabular extractor is used, but you can also use the list or item extractors:

from parsera import Parsera

scraper = Parsera(extractor=Parsera.ExtractorType.LIST)
# or
scraper = Parsera(extractor=Parsera.ExtractorType.ITEM)

The tabular extractor is used to find rows of tabular data and has output of the form:

[
    {"name": "name1", "price": "100"},
    {"name": "name2", "price": "150"},
    {"name": "name3", "price": "300"},
]

The list extractor is used to find lists of different values and has output of the form:

{
    "name": ["name1", "name2", "name3"],
    "price": ["100", "150", "300"]
}

The item extractor is used to get singular items from a page like a title or price and has output of the form:

{
    "name": "name1",
    "price": "100"
}

Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

import nest_asyncio
nest_asyncio.apply()

Or instead of calling run method use async arun.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

Oct 8, 2025

0.2.5

Apr 9, 2025

0.2.4

Feb 6, 2025

0.2.3

Jan 27, 2025

0.2.2

Jan 22, 2025

0.2.1

Dec 4, 2024

0.2.0

Nov 5, 2024

0.1.12

Oct 30, 2024

0.1.11

Oct 23, 2024

0.1.10

Oct 23, 2024

0.1.9

Oct 22, 2024

0.1.8

Sep 23, 2024

0.1.7

Aug 31, 2024

This version

0.1.5

Aug 27, 2024

0.1.4

Aug 23, 2024

0.1.3

Aug 16, 2024

0.1.3.dev0 pre-release

Aug 23, 2024

0.1.2

Aug 13, 2024

0.1.1

Aug 13, 2024

0.1.0

Aug 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsera-0.1.5.tar.gz (14.1 kB view details)

Uploaded Aug 27, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parsera-0.1.5-py3-none-any.whl (14.5 kB view details)

Uploaded Aug 27, 2024 Python 3

File details

Details for the file parsera-0.1.5.tar.gz.

File metadata

Download URL: parsera-0.1.5.tar.gz
Upload date: Aug 27, 2024
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`5c19cbca5e3e857520a7ce4e595fa28f64e33adceb7a7ef7b14b83333a587c27`
MD5	`429df5e07077015960b8e86b551be81c`
BLAKE2b-256	`e95a36c639ff3842f5f5f84220806166040eb4e53b0496fce6ef8103782366f4`

See more details on using hashes here.

File details

Details for the file parsera-0.1.5-py3-none-any.whl.

File metadata

Download URL: parsera-0.1.5-py3-none-any.whl
Upload date: Aug 27, 2024
Size: 14.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9233f3f234e9e5ee2dedce2a35157691f0989ac2a2582ac317386cacdeb1e025`
MD5	`5c6fd42df4b7662f3cd7981667dd6014`
BLAKE2b-256	`f974a91e293f0c83ae238ee015604504c533e163342d4e9d4618061302b4a936`

See more details on using hashes here.

parsera 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

📦 Parsera

Why Parsera?

Installation

Basic usage

Using proxy

Run with custom model

Run local model with HuggingFace `Trasformers`

Using different extractor types

Running with Jupyter Notebook:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

parsera 0.1.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

📦 Parsera

Why Parsera?

Installation

Basic usage

Using proxy

Run with custom model

Run local model with HuggingFace Trasformers

Using different extractor types

Running with Jupyter Notebook:

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Run local model with HuggingFace `Trasformers`