Data extraction SDK for Playwright 🐒🍌
Project description
🦍 Harambe Web extraction SDK 🦍
Harambe
Harambe is the extraction SDK for Reworkd. It provides a simple interface for interacting with the web. It provides a unified interface and runtime for both manual and automatically created web extractors
Setup and Installation
To install Harambe, clone the repository and install the requirements. All requirements are managed via poetry.
git clone https://github.com/reworkd/harambe.git
poetry install
Folder Structure
The scrapers
folder contains all the scrapers. The harambe
folder
contains the SDK and utility functions.
Example Scraper
Generally scrapers come in two types, listing and detail scrapers. Listing scrapers are used to collect a list of items to scrape. Detail scrapers are used to scrape the details of a single item.
If all the items that you want to scrape are available on a single page, then you can use a detail scraper to scrape all the items. If the items are spread across multiple pages, then you will need to use a listing scraper to collect the items and then use a detail scraper to scrape the details of each item.
ALL scrapers must be decorated with SDK.scraper
. This decorator
registers the scraper with the SDK and provides the SDK with the
necessary information to run the scraper.
Detail Only Scraper
Shown below is an example detail scraper. The context
parameter is
used to pass data from the listing scraper to the detail scraper.
In this example, the context
parameter is used to pass the phone
import asyncio
from typing import Any
from playwright.async_api import Page
from harambe import SDK
from harambe import PlaywrightUtils as Pu
SELECTORS = {
"last_page": "",
"list_view": "//div[@class='et_pb_blurb_content']",
"name": "//h4/*[self::span or self::a]",
"fax": ">Fax.*?strong>(.*?)<br>",
# etc...
}
# Annotation registers the scraper with the SDK
@SDK.scraper(domain="https://apprhs.org/our-locations/", stage="detail")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
page: Page = sdk.page
locations = await page.locator(SELECTORS["list_view"]).all()
for location in locations:
# Save the data to the database or file
await sdk.save_data(
{
"name": await Pu.get_text(location, SELECTORS["name"]),
"fax": await Pu.parse_by_regex(location, SELECTORS["fax"]),
# etc...
}
)
if __name__ == "__main__":
asyncio.run(SDK.run(scrape, "https://apprhs.org/our-locations/"))
Listing Scraper
Shown below is an example listing scraper. Use SDK.enqueue
to to add
urls that will need be scraped by the detail scraper. The context
parameter is used to pass data from the listing scraper to the detail
scraper.
import asyncio
from typing import Any
from playwright.async_api import Page
from harambe import SDK
SELECTORS = {}
@SDK.scraper(domain="https://example.org", stage="listing")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
page: Page = sdk.page
for url in [
"https://example.org/1",
"https://example.org/2",
"https://example.org/3",
]: # Imagine these are locators
await sdk.enqueue(
url,
context={
"phone": "123-456-7890",
# Some data from the listing page that we want to pass to the detail page, (optional)
"foo": "bar",
"baz": "qux",
},
)
@SDK.scraper(domain="https://example.org", stage="detail")
async def scrape_detail(sdk: SDK, url: str, context: Any) -> None:
page: Page = sdk.page
# Grab all properties from the context
detail = {**context}
detail["fax"] = "123-456-7890" # Some data grabbed from the detail page
detail["type"] = "Hospital" # Some data grabbed from the detail page
await sdk.save_data(detail) # Save the data to the database
if __name__ == "__main__":
asyncio.run(SDK.run(scrape, "https://navicenthealth.org/locations"))
asyncio.run(SDK.run_from_file(scrape_detail))
Using Cache
The code below is an example detail scraper that relies on HAR cache that it creates during initial run, subsequently using it as source of data to improve speed and consume less bandwidth.
import asyncio
import os.path
from typing import Any
from playwright.async_api import Page
from harambe import SDK
from harambe import PlaywrightUtils as Pu
HAR_FILE_PATH = "bananas.har"
SELECTORS = {
"last_page": "",
"list_view": "//div[@class='et_pb_blurb_content']",
"name": "//h4/*[self::span or self::a]",
"fax": ">Fax.*?strong>(.*?)<br>",
# etc...
}
async def setup(sdk: SDK) -> None:
page: Page = sdk.page
already_cached = os.path.isfile(HAR_FILE_PATH)
if already_cached:
await page.route_from_har(HAR_FILE_PATH, not_found="fallback")
else:
await page.route_from_har(HAR_FILE_PATH, not_found="fallback", update=True)
# Annotation registers the scraper with the SDK
@SDK.scraper(domain="https://apprhs.org/our-locations/", stage="detail")
async def scrape(sdk: SDK, url: str, *args: Any, **kwargs: Any) -> None:
page: Page = sdk.page
locations = await page.locator(SELECTORS["list_view"]).all()
for location in locations:
# Save the data to the database or file
await sdk.save_data(
{
"name": await Pu.get_text(location, SELECTORS["name"]),
"fax": await Pu.parse_by_regex(location, SELECTORS["fax"]),
# etc...
}
)
if __name__ == "__main__":
asyncio.run(SDK.run(scrape, "https://apprhs.org/our-locations/", setup=setup))
Running a Scraper
You can use poetry to run a scraper. The run
command takes the
scraper function and the url to scrape. The run_from_file
command
takes the scraper function and the path to the file containing the
urls to scrape.
poetry run python poetry run python scrapers/medical/apprhs.py
Submitting a PR
Before submitting a PR, please run the following commands to ensure that your code is formatted correctly.
make FORMAT LINT
Happy extraction! 🦍
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file harambe_sdk-0.18.15.tar.gz
.
File metadata
- Download URL: harambe_sdk-0.18.15.tar.gz
- Upload date:
- Size: 24.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.4 Darwin/23.1.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df6cf66527f3d98fb9bd78f2fbeb8701b6a2a98b03a6a1896b56b81f6e538ef7 |
|
MD5 | 32bdce3f778fb72ed386084642ed527a |
|
BLAKE2b-256 | f9aa1fcf98b0fd0499f3824feb50fa758319f53612fa205fb22613ef7603e7fe |
File details
Details for the file harambe_sdk-0.18.15-py3-none-any.whl
.
File metadata
- Download URL: harambe_sdk-0.18.15-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.12.4 Darwin/23.1.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3050b6859cbbeabedd678121dd25b826456751d12b9c97b10c955700a1503a1c |
|
MD5 | 1925187d83306661e787850921a2e6f7 |
|
BLAKE2b-256 | 45f51705b950aba83f5d2fb3c39dfe5412667654a4a3ba6c0500016d1e6c04c3 |