Data extraction SDK for Playwright 🐒🍌

Project description

Tarsier Monkey

🦍 Harambe Web extraction SDK 🦍

Harambe

Harambe is the extraction SDK for Reworkd. It provides a simple interface for interacting with the web. It provides a unified interface and runtime for both manual and automatically created web extractors

Setup and Installation
Folder Structure
Example Scraper
Running a Scraper
Submitting a PR

Setup and Installation

To install Harambe, clone the repository and install the requirements. All requirements are managed via poetry.

git clone https://github.com/reworkd/harambe.git
poetry install

Folder Structure

The scrapers folder contains all the scrapers. The harambe folder contains the SDK and utility functions.

Example Scraper

Generally scrapers come in two types, listing and detail scrapers. Listing scrapers are used to collect a list of items to scrape. Detail scrapers are used to scrape the details of a single item.

If all the items that you want to scrape are available on a single page, then you can use a detail scraper to scrape all the items. If the items are spread across multiple pages, then you will need to use a listing scraper to collect the items and then use a detail scraper to scrape the details of each item.

ALL scrapers must be decorated with SDK.scraper. This decorator registers the scraper with the SDK and provides the SDK with the necessary information to run the scraper.

Detail Only Scraper

Shown below is an example detail scraper. The context parameter is used to pass data from the listing scraper to the detail scraper. In this example, the context parameter is used to pass the phone

import asyncio
from typing import Any

from playwright.async_api import Page

from harambe import SDK
from harambe import PlaywrightUtils as Pu

SELECTORS = {
    "last_page": "",
    "list_view": "//div[@class='et_pb_blurb_content']",
    "name": "//h4/*[self::span or self::a]",
    "fax": ">Fax.*?strong>(.*?)<br>",
    # etc...
}


# Annotation registers the scraper with the SDK
@SDK.scraper(domain="https://apprhs.org/our-locations/", stage="detail")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page

    locations = await page.locator(SELECTORS["list_view"]).all()
    for location in locations:        
        # Save the data to the database or file
        await sdk.save_data(
            {
                "name": await Pu.get_text(location, SELECTORS["name"]),
                "fax": await Pu.parse_by_regex(location, SELECTORS["fax"]),
                # etc...
            }
        )


if __name__ == "__main__":
    asyncio.run(SDK.run(scrape, "https://apprhs.org/our-locations/"))

Listing Scraper

Shown below is an example listing scraper. Use SDK.enqueue to to add urls that will need be scraped by the detail scraper. The context parameter is used to pass data from the listing scraper to the detail scraper.

import asyncio
from typing import Any

from playwright.async_api import Page

from harambe import SDK

SELECTORS = {}


@SDK.scraper(domain="https://example.org", stage="listing")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page

    for url in [
        "https://example.org/1",
        "https://example.org/2",
        "https://example.org/3",
    ]:  # Imagine these are locators
        await sdk.enqueue(
            url,
            context={
                "phone": "123-456-7890",
                # Some data from the listing page that we want to pass to the detail page, (optional)
                "foo": "bar",
                "baz": "qux",
            },
        )


@SDK.scraper(domain="https://example.org", stage="detail")
async def scrape_detail(sdk: SDK, url: str, context: Any) -> None:
    page: Page = sdk.page

    # Grab all properties from the context
    detail = {**context}

    detail["fax"] = "123-456-7890"  # Some data grabbed from the detail page
    detail["type"] = "Hospital"  # Some data grabbed from the detail page
    await sdk.save_data(detail)  # Save the data to the database


if __name__ == "__main__":
    asyncio.run(SDK.run(scrape, "https://navicenthealth.org/locations"))
    asyncio.run(SDK.run_from_file(scrape_detail))

Using Cache

The code below is an example detail scraper that relies on HAR cache that it creates during initial run, subsequently using it as source of data to improve speed and consume less bandwidth.

import asyncio
import os.path
from typing import Any

from playwright.async_api import Page

from harambe import SDK
from harambe import PlaywrightUtils as Pu

HAR_FILE_PATH = "bananas.har"
SELECTORS = {
    "last_page": "",
    "list_view": "//div[@class='et_pb_blurb_content']",
    "name": "//h4/*[self::span or self::a]",
    "fax": ">Fax.*?strong>(.*?)<br>",
    # etc...
}


async def setup(sdk: SDK) -> None:
    page: Page = sdk.page

    already_cached = os.path.isfile(HAR_FILE_PATH)

    if already_cached:
        await page.route_from_har(HAR_FILE_PATH, not_found="fallback")
    else:
        await page.route_from_har(HAR_FILE_PATH, not_found="fallback", update=True)


# Annotation registers the scraper with the SDK
@SDK.scraper(domain="https://apprhs.org/our-locations/", stage="detail")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page

    locations = await page.locator(SELECTORS["list_view"]).all()
    for location in locations:
        # Save the data to the database or file
        await sdk.save_data(
            {
                "name": await Pu.get_text(location, SELECTORS["name"]),
                "fax": await Pu.parse_by_regex(location, SELECTORS["fax"]),
                # etc...
            }
        )


if __name__ == "__main__":
    asyncio.run(SDK.run(scrape, "https://apprhs.org/our-locations/", setup=setup))

Running a Scraper

You can use poetry to run a scraper. The run command takes the scraper function and the url to scrape. The run_from_file command takes the scraper function and the path to the file containing the urls to scrape.

poetry run python poetry run python scrapers/medical/apprhs.py

Submitting a PR

Before submitting a PR, please run the following commands to ensure that your code is formatted correctly.

make FORMAT LINT

Happy extraction! 🦍

Project details

Release history Release notifications | RSS feed

0.50.2

Nov 13, 2024

0.50.1

Nov 13, 2024

0.50.0

Nov 6, 2024

0.49.0

Nov 6, 2024

0.46.1

Nov 6, 2024

0.46.0

Nov 5, 2024

0.45.3

Nov 5, 2024

0.45.2

Nov 5, 2024

0.45.0

Nov 5, 2024

0.44.1

Nov 5, 2024

0.44.0

Nov 5, 2024

0.43.1

Nov 5, 2024

0.43.0

Nov 5, 2024

0.42.1

Nov 2, 2024

0.42.0

Nov 2, 2024

0.41.0

Oct 31, 2024

0.40.2

Oct 31, 2024

0.40.1

Oct 30, 2024

0.40.0

Oct 30, 2024

0.32.0

Oct 30, 2024

0.31.32

Oct 24, 2024

0.31.2

Oct 24, 2024

0.31.1

Oct 23, 2024

0.31.0

Oct 23, 2024

0.30.3

Oct 23, 2024

0.30.2

Oct 22, 2024

0.30.1

Oct 22, 2024

0.30.0

Oct 21, 2024

0.29.0

Oct 21, 2024

0.28.10

Oct 17, 2024

0.28.9

Oct 17, 2024

0.28.8

Oct 17, 2024

0.28.7

Oct 4, 2024

0.28.6

Oct 4, 2024

0.28.5

Oct 4, 2024

0.28.4

Oct 1, 2024

0.28.3

Sep 25, 2024

0.28.2

Sep 19, 2024

0.28.1

Sep 18, 2024

0.28.0

Sep 14, 2024

0.27.0

Sep 12, 2024

0.26.0

Sep 10, 2024

0.25.2

Sep 9, 2024

0.25.1

Sep 5, 2024

0.25.0

Aug 30, 2024

0.24.4

Aug 29, 2024

0.24.3

Aug 28, 2024

0.24.2

Aug 26, 2024

0.24.1

Aug 26, 2024

0.24.0

Aug 23, 2024

This version

0.23.0

Aug 23, 2024

0.22.0

Aug 21, 2024

0.21.1

Aug 20, 2024

0.21.0

Aug 20, 2024

0.20.1

Aug 16, 2024

0.20.0

Aug 15, 2024

0.19.13

Aug 13, 2024

0.19.12

Aug 13, 2024

0.19.11

Aug 12, 2024

0.19.10

Aug 12, 2024

0.19.0

Aug 7, 2024

0.18.31

Aug 1, 2024

0.18.30

Jul 30, 2024

0.18.20

Jul 28, 2024

0.18.15

Jul 26, 2024

0.18.14

Jul 26, 2024

0.18.13

Jul 26, 2024

0.18.11

Jul 25, 2024

0.18.10

Jul 23, 2024

0.18.1

Jul 23, 2024

0.17.21

Jul 22, 2024

0.17.11

Jul 22, 2024

0.17.1

Jul 17, 2024

0.17.0

Jul 17, 2024

0.16.0

Jul 16, 2024

0.15.7

Jul 15, 2024

0.15.6

Jul 15, 2024

0.15.5

Jul 15, 2024

0.15.4

Jul 15, 2024

0.15.3

Jul 15, 2024

0.15.2

Jul 13, 2024

0.15.1

Jul 12, 2024

0.15.0

Jul 12, 2024

0.14.3

Jul 11, 2024

0.14.2

Jul 10, 2024

0.14.1

Jul 10, 2024

0.14.0

Jun 27, 2024

0.13.5

Jun 20, 2024

0.13.4

Jun 11, 2024

0.13.3

Jun 5, 2024

0.13.2

Jun 5, 2024

0.13.1

May 29, 2024

0.13.0

May 29, 2024

0.12.0

May 29, 2024

0.11.0

May 22, 2024

0.10.0

May 20, 2024

0.9.31

Apr 19, 2024

0.9.3

Apr 17, 2024

0.9.2

Apr 4, 2024

0.9.1

Mar 22, 2024

0.9.0

Mar 21, 2024

0.8.43

Mar 20, 2024

0.8.42

Mar 17, 2024

0.8.41

Mar 8, 2024

0.8.4

Mar 8, 2024

0.8.3

Mar 7, 2024

0.8.2

Mar 7, 2024

0.8.1

Mar 6, 2024

0.8.0

Mar 1, 2024

0.7.0

Mar 1, 2024

0.6.2

Feb 24, 2024

0.6.1

Feb 23, 2024

0.6.0

Feb 22, 2024

0.5.2

Feb 13, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harambe_sdk-0.23.0.tar.gz (24.1 kB view details)

Uploaded Aug 23, 2024 Source

Built Distribution

harambe_sdk-0.23.0-py3-none-any.whl (30.4 kB view details)

Uploaded Aug 23, 2024 Python 3

File details

Details for the file harambe_sdk-0.23.0.tar.gz.

File metadata

Download URL: harambe_sdk-0.23.0.tar.gz
Upload date: Aug 23, 2024
Size: 24.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.11.9 Linux/6.5.0-1025-azure

File hashes

Hashes for harambe_sdk-0.23.0.tar.gz
Algorithm	Hash digest
SHA256	`2d9bbb7cb4bac00ed7a0c02e26a429d81b6674ac3c337a3302cb1ba30006d42e`
MD5	`32c4e8c9d7a80b70ce540306b771416b`
BLAKE2b-256	`3eac1cb1895f5a5cee6b3f91b624d669fb8fcc93415f79b315a1044758216a85`

See more details on using hashes here.

File details

Details for the file harambe_sdk-0.23.0-py3-none-any.whl.

File metadata

Download URL: harambe_sdk-0.23.0-py3-none-any.whl
Upload date: Aug 23, 2024
Size: 30.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.6.1 CPython/3.11.9 Linux/6.5.0-1025-azure

File hashes

Hashes for harambe_sdk-0.23.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1336f674829211cabf18a63915419e0329d13d81d93a5a20571968c68094e0dd`
MD5	`bd0e8bd816cf634c15beb4ca15f9762c`
BLAKE2b-256	`dfe4622ebcfb6ef6ccb370d452039c327bf9cce4a713797e14b7048da8836e02`