Skip to main content

A simple and light Playwright-based scraper

Project description

pw-simple-scraper

A lightweight, easy-to-use web scraper built with Python and Playwright

PyPI Python License: MIT

한국어 보러가기


Overview

  • pw-simple-scraper scrapes desired elements from a web page.
  • Provide a URL + CSS selector, and it will return the matching elements as a list of strings.
  • The result is wrapped in a ScrapeResult object. You can access the extracted values via .result (List[str]).


Installation

# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper
  • Since this scraper is built on top of Playwright, both the Playwright library and the Chromium browser are required.


Usage

from pw-simple_scraper import scrape_context, scrape_href

# Extract text
res = scrape_context("https://example.com", "h3")
print(res.result)   # ['h3-type-content1', 'h3-type-content2', ...]
print(res.count)    # n (number of scraped elements)

# Extract links
links = scrape_href("https://example.com", "a")
print(links.result) # ['https://www.iana.org/domains/example', ...]

# Apply timeout option (default: 30 seconds)
scrape_context("https://example.com", "something", timeout=10) # 10 seconds

Result is a ScrapeResult object

@dataclass
class ScrapeResult:
    url: str
    selector: str
    result: List[str]       # Extracted values
    count: int              # Number of values
    fetched_at: datetime    # Execution timestamp (UTC)


FAQ

  • Installed but browser fails to launch

    • You must install the browser with python -m playwright install chromium (Be mindful of the Linux --with-deps option.)
  • RuntimeError: All strategies failed

    • This may happen if the selector doesn’t exist or the page loads slowly. Double-check your selector and try increasing the timeout.
  • Scraping inside iframe

    • Planned for future support.
  • xpath support

    • Planned for future support.
  • robot.txt support

    • Will be added as a configurable option in the future.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pw_simple_scraper-0.1.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pw_simple_scraper-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file pw_simple_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: pw_simple_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pw_simple_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d1bbbe34064fabaa709f845be263d672d02b5827a8d60fd8ad4de781af3c8b44
MD5 854b9f14e3474efab61b190d2c06a23c
BLAKE2b-256 ddbf0c992046be3a97549aa66d00dfb2ae2cac291e9369e0b3d68a4807e0fe96

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.0.tar.gz:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pw_simple_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pw_simple_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1312a574c01f47e91f8a9f458c3f65feb06fc4072f53967cb94baf838cd2ea6c
MD5 05f3d5f85de91fc4f0b83e0e4e51620e
BLAKE2b-256 9bfd7160511ca6de12bf704daad2e2f770c9e33b39d1aafbb2d3899347f95839

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.0-py3-none-any.whl:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page