Lightweight library for scraping web-sites with LLMs
Project description
📦 Parsera
Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.
Why Parsera?
Because it's simple and lightweight, with minimal token use which boosts speed and reduces expenses.
Table of Contents
- Installation
- Documentation
- Basic usage
- Running with Jupyter Notebook
- Running with CLI
- Running in Docker
Installation
pip install parsera
playwright install
Documentation
Check out documentation to learn more about other features, like running custom models and playwright scripts.
Basic usage
If you want to use OpenAI, remember to set up OPENAI_API_KEY env variable.
You can do this from python with:
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY_HERE"
Next, you can run a basic version that uses gpt-4o-mini
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scraper = Parsera()
result = scraper.run(url=url, elements=elements)
result variable will contain a json with a list of records:
[
{
"Title":"Hacking the largest airline and hotel rewards platform (2023)",
"Points":"104",
"Comments":"24"
},
...
]
There is also arun async method available:
result = await scrapper.arun(url=url, elements=elements)
Running with Jupyter Notebook:
Either place this code at the beginning of your notebook:
import nest_asyncio
nest_asyncio.apply()
Or instead of calling run method use async arun.
Running with CLI
Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file
Usage
You can configure elements to parse using JSON string or FILE.
Optionally, you can provide FILE to write output and amount of SCROLLS, that you want to do on the page
python -m parsera.main URL {--scheme '{"title":"h1"}' | --file FILENAME} [--scrolls SCROLLS] [--output FILENAME]
Running in Docker
In case of issues with your local environment you can run Parsera with Docker, see documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsera-0.2.0.tar.gz.
File metadata
- Download URL: parsera-0.2.0.tar.gz
- Upload date:
- Size: 18.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42544552542914c9d748bceda9a8e91d3bf05a6271b5da45b45e7ae796314166
|
|
| MD5 |
ab9b258916f0cf740218800c22ce33fc
|
|
| BLAKE2b-256 |
9343e3dc54e65b2988c5b906e18fa6d44a43c4e62e43db0758e00ac8a11f87f7
|
File details
Details for the file parsera-0.2.0-py3-none-any.whl.
File metadata
- Download URL: parsera-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ea310c3b14949f79d092b9ca30035eeb09d9bcafbbfc9035d4d760179514428
|
|
| MD5 |
1abf8bf673673288f45fa51b6d0cce1b
|
|
| BLAKE2b-256 |
891b0c58987131b9c5d6388dc9963aae69d2474ec97d89a3c08c48db02434615
|