Skip to main content

Lightweight library for scraping web-sites with LLMs

Project description

📦 Parsera

Discord Downloads Run Parsera Actor on Apify

Lightweight Python library for scraping websites with LLMs. You can test it on Parsera website.

Why Parsera?

Because it's simple and lightweight. With interface as simple as:

scraper = Parsera()
result = scraper.run(url=url, elements=elements)

Table of Contents

Installation

pip install parsera
playwright install

Documentation

Check out documentation to learn more about other features, like running custom models and playwright scripts.

Basic usage

First, set up PARSERA_API_KEY env variable (If you want to run custom LLM see Custom Models). You can do this from python with:

import os

os.environ["PARSERA_API_KEY"] = "YOUR_PARSERA_API_KEY_HERE"

Next, you can run a basic version:

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scraper = Parsera()
result = scraper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Running with Jupyter Notebook:

Either place this code at the beginning of your notebook:

import nest_asyncio
nest_asyncio.apply()

Or instead of calling run method use async arun.

Running with CLI

Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file

Usage

You can configure elements to parse using JSON string or FILE. Optionally, you can provide FILE to write output and amount of SCROLLS, that you want to do on the page

python -m parsera.main URL {--scheme '{"title":"h1"}' | --file FILENAME} [--scrolls SCROLLS] [--output FILENAME]

Running in Docker

In case of issues with your local environment you can run Parsera with Docker, see documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsera-0.2.5.tar.gz (20.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsera-0.2.5-py3-none-any.whl (23.1 kB view details)

Uploaded Python 3

File details

Details for the file parsera-0.2.5.tar.gz.

File metadata

  • Download URL: parsera-0.2.5.tar.gz
  • Upload date:
  • Size: 20.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.2.5.tar.gz
Algorithm Hash digest
SHA256 8e269ba41380f8b2ef2b4990375a55cde1a89612fd0f9367bc2e2378272c5958
MD5 60d7aab769143da7c845db20ef52a4d9
BLAKE2b-256 9cb3987ea6d16dbcf0c239b4c8b7e38d7a4ec7aa966b94c64513755704d5a62d

See more details on using hashes here.

File details

Details for the file parsera-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: parsera-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 23.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.2 Darwin/23.4.0

File hashes

Hashes for parsera-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b533e7f360423e699361dd0f0a0beeb5a5d8eb35f73357f9924ac335f9c7bd91
MD5 c26fe34c5117f9129fdabe751889ba65
BLAKE2b-256 fc8c6ae1062279b3790a196d2d6d3a32f6dbda576536c4a23e49c030fef66dce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page