Skip to main content

Data extraction SDK for Playwright 🐒🍌

Project description

Tarsier Monkey

🦍 Harambe Web extraction SDK 🦍

Harambe

Harambe is the extraction SDK for Reworkd. It provides a simple interface for interacting with the web. It provides a unified interface and runtime for both manual and automatically created web extractors



Setup and Installation

To install Harambe, clone the repository and install the requirements. All requirements are managed via poetry.

git clone https://github.com/reworkd/harambe.git
poetry install

Example Scraper

Generally scrapers come in two types, listing and detail scrapers. Listing scrapers are used to collect a list of items to scrape. Detail scrapers are used to scrape the details of a single item.

If all the items that you want to scrape are available on a single page, then you can use a detail scraper to scrape all the items. If the items are spread across multiple pages, then you will need to use a listing scraper to collect the items and then use a detail scraper to scrape the details of each item.

ALL scrapers must be decorated with SDK.scraper. This decorator registers the scraper with the SDK and provides the SDK with the necessary information to run the scraper.

Detail Only Scraper

Shown below is an example detail scraper. The context parameter is used to pass data from the listing scraper to the detail scraper. In this example, the context parameter is used to pass the phone

import asyncio
import math
import re
from typing import Any
from playwright.async_api import Page
from harambe import SDK, Schemas
from harambe import PlaywrightUtils as Pu

@SDK.scraper(
    domain="https://food.kp.gov.pk",
    stage="detail",
)
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page
    await page.goto(url)
    await page.wait_for_selector("#main_content a")
    cards = await page.query_selector_all("#main_content a")
    for card in cards:
        title = await card.inner_text()
        href = await card.get_attribute("href")
        if title and href:
            await sdk.save_data(
                {
                    "title": title,
                    "document_url": href,
                }
            )


if __name__ == "__main__":
    asyncio.run(
        SDK.run(
            scrape,
            "https://food.kp.gov.pk/page/rules_and_regulations",
            schema={},
        )
    )

Listing Scraper

Shown below is an example listing scraper. Use SDK.enqueue to to add urls that will need be scraped by the detail scraper. The context parameter is used to pass data from the listing scraper to the detail scraper.

import asyncio
import math
import re
from typing import Any
from playwright.async_api import Page
from harambe import SDK
from harambe import PlaywrightUtils as Pu

@SDK.scraper(
    domain="https://kpcode.kp.gov.pk",
    stage="listing",
)
async def listing_scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page
    await page.wait_for_selector(".artlist a")
    docs = await page.query_selector_all(".artlist a")
    for doc in docs:
        href = await doc.get_attribute("href")
        await sdk.enqueue(href)

    async def pager():
        next_page_element = await page.query_selector("li[title='Next'] > a")
        return next_page_element

    await sdk.paginate(pager)


@SDK.scraper(
    domain="https://kpcode.kp.gov.pk",
    stage="detail",
)
async def detail_scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page
    await page.wait_for_selector(".header_h2")
    title = await Pu.get_text(page, ".header_h2")
    link = await Pu.get_link(page, "a[href*=pdf]")
    await sdk.save_data({"title": title, "document_url ": link})


if __name__ == "__main__":
    asyncio.run(
        SDK.run(
            listing_scrape,
            "https://kpcode.kp.gov.pk/homepage/list_all_law_and_rule/879351",
            headless=False,
            schema={},
        )
    )
    asyncio.run(SDK.run_from_file(detail_scrape, schema={}))

Using Cache

The code below is an example detail scraper that relies on HAR cache that it creates during initial run, subsequently using it as source of data to improve speed and consume less bandwidth.

import asyncio
import os.path
from typing import Any

from playwright.async_api import Page

from harambe import SDK
from harambe import PlaywrightUtils as Pu

HAR_FILE_PATH = "bananas.har"
SELECTORS = {
    "last_page": "",
    "list_view": "//div[@class='et_pb_blurb_content']",
    "name": "//h4/*[self::span or self::a]",
    "fax": ">Fax.*?strong>(.*?)<br>",
    # etc...
}


async def setup(sdk: SDK) -> None:
    page: Page = sdk.page

    already_cached = os.path.isfile(HAR_FILE_PATH)

    if already_cached:
        await page.route_from_har(HAR_FILE_PATH, not_found="fallback")
    else:
        await page.route_from_har(HAR_FILE_PATH, not_found="fallback", update=True)


# Annotation registers the scraper with the SDK
@SDK.scraper(domain="https://apprhs.org/our-locations/", stage="detail")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
    page: Page = sdk.page

    locations = await page.locator(SELECTORS["list_view"]).all()
    for location in locations:
        # Save the data to the database or file
        await sdk.save_data(
            {
                "name": await Pu.get_text(location, SELECTORS["name"]),
                "fax": await Pu.parse_by_regex(location, SELECTORS["fax"]),
                # etc...
            }
        )


if __name__ == "__main__":
    asyncio.run(SDK.run(scrape, "https://apprhs.org/our-locations/", setup=setup , schema= {}))

Running a Scraper

You can use poetry to run a scraper. The run command takes the scraper function and the url to scrape. The run_from_file command takes the scraper function and the path to the file containing the urls to scrape.

poetry run python <path_to_your_file>

Happy extraction! 🦍

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harambe_sdk-0.28.4.tar.gz (26.7 kB view details)

Uploaded Source

Built Distribution

harambe_sdk-0.28.4-py3-none-any.whl (33.8 kB view details)

Uploaded Python 3

File details

Details for the file harambe_sdk-0.28.4.tar.gz.

File metadata

  • Download URL: harambe_sdk-0.28.4.tar.gz
  • Upload date:
  • Size: 26.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.12.6 Linux/6.8.0-1014-azure

File hashes

Hashes for harambe_sdk-0.28.4.tar.gz
Algorithm Hash digest
SHA256 ba638f36bc357caf7aaed76071a8c3cc3f790aad7b3f18c486d6c1b01ff9a158
MD5 13f725127d082379aa16283cc3f3d04e
BLAKE2b-256 c7bbe1d873193d6da3349d9c85f1e2f2a610f7c1f5cf02e17fc02c170a99f2ff

See more details on using hashes here.

File details

Details for the file harambe_sdk-0.28.4-py3-none-any.whl.

File metadata

  • Download URL: harambe_sdk-0.28.4-py3-none-any.whl
  • Upload date:
  • Size: 33.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.12.6 Linux/6.8.0-1014-azure

File hashes

Hashes for harambe_sdk-0.28.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ad8e94d249a5876c5cd16d8210c0caf440e0355e91bc83d1713cf255270b1c1e
MD5 3693bd704a9d564804b43b35dd96f9aa
BLAKE2b-256 46f750cf48c673deccfb8fdef80da7429ef0e417c50feb998b4f08f3e17bfd91

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page