Skip to main content

A simple and light Playwright-based scraper

Project description

pw-simple-scraper

A lightweight, easy-to-use web scraper built with Python and Playwright

PyPI Python License: MIT

한국어 보러가기


Overview

  • pw-simple-scraper scrapes desired elements from a web page.
  • Provide a URL + CSS selector, and it will return the matching elements as a list of strings.
  • The result is wrapped in a ScrapeResult object. You can access the extracted values via .result (List[str]).


Installation

# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper
  • Since this scraper is built on top of Playwright, both the Playwright library and the Chromium browser are required.


Usage

from pw-simple_scraper import scrape_context, scrape_attrs

# Extract text
res = scrape_context("https://example.com", "h3")
print(res.result)   # ['h3-type-content1', 'h3-type-content2', ...]
print(res.count)    # n (number of scraped elements)

# Extract links by Attribute (herf ...)
links = scrape_attr("https://example.com", "a", "herf")
print(links.result) # ['https://www.iana.org/domains/example', ...]

# Apply timeout option (default: 30 seconds)
scrape_context("https://example.com", "something", timeout=10) # 10 seconds
links = scrape_attr("https://example.com", "a", "herf", timeout=20) # 20 seconds

Result is a ScrapeResult object

@dataclass
class ScrapeResult:
    url: str
    selector: str
    result: List[str]       # Extracted values
    count: int              # Number of values
    fetched_at: datetime    # Execution timestamp (UTC)


FAQ

  • Installed but browser fails to launch

    • You must install the browser with python -m playwright install chromium (Be mindful of the Linux --with-deps option.)
  • RuntimeError: All strategies failed

    • This may happen if the selector doesn’t exist or the page loads slowly. Double-check your selector and try increasing the timeout.
  • Scraping inside iframe

    • Planned for future support.
  • xpath support

    • Planned for future support.
  • robot.txt support

    • Will be added as a configurable option in the future.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pw_simple_scraper-0.1.2.tar.gz (16.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pw_simple_scraper-0.1.2-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file pw_simple_scraper-0.1.2.tar.gz.

File metadata

  • Download URL: pw_simple_scraper-0.1.2.tar.gz
  • Upload date:
  • Size: 16.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pw_simple_scraper-0.1.2.tar.gz
Algorithm Hash digest
SHA256 af1f5c925ffc399205a8e23de85b23744680c5d517d484052db1ed91ca674023
MD5 85041c43589156f67f4ce46634a812fc
BLAKE2b-256 770c8cb06162b5a59dd3b70a51fb1a2d9e920d1c2e098410d3bfddd8dc36659e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.2.tar.gz:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pw_simple_scraper-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pw_simple_scraper-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 d56ff6e66044eadfee4b08fafa437ea773db4365a08093c0ba17b868997d3706
MD5 1ce2409e8cbc40552ee20daf02f201e0
BLAKE2b-256 c2975e3231261a22153a96d6a7d03a291976296fce24c87fd3d8817d978332a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.2-py3-none-any.whl:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page