A simple and light Playwright-based scraper
Project description
pw-simple-scraper
A lightweight, easy-to-use web scraper built with Python and Playwright
Overview
pw-simple-scraperscrapes desired elements from a web page.- Provide a
URL + CSSselector, and it will return the matching elements as a list of strings. - The result is wrapped in a
ScrapeResultobject. You can access the extracted values via.result(List[str]).
Installation
# 1. Install Playwright
pip install playwright
# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium
# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium
# 3. Install pw-simple-scraper
pip install pw-simple-scraper
- Since this scraper is built on top of
Playwright, both thePlaywrightlibrary and theChromiumbrowser are required.
Usage
from pw-simple_scraper import scrape_context, scrape_attrs
# Extract text
res = scrape_context("https://example.com", "h3")
print(res.result) # ['h3-type-content1', 'h3-type-content2', ...]
print(res.count) # n (number of scraped elements)
# Extract links by Attribute (herf ...)
links = scrape_attr("https://example.com", "a", "herf")
print(links.result) # ['https://www.iana.org/domains/example', ...]
# Apply timeout option (default: 30 seconds)
scrape_context("https://example.com", "something", timeout=10) # 10 seconds
links = scrape_attr("https://example.com", "a", "herf", timeout=20) # 20 seconds
Result is a ScrapeResult object
@dataclass
class ScrapeResult:
url: str
selector: str
result: List[str] # Extracted values
count: int # Number of values
fetched_at: datetime # Execution timestamp (UTC)
FAQ
-
Installed but browser fails to launch
- You must install the browser with
python -m playwright install chromium(Be mindful of the Linux--with-depsoption.)
- You must install the browser with
-
RuntimeError: All strategies failed
- This may happen if the selector doesn’t exist or the page loads slowly. Double-check your selector and try increasing the
timeout.
- This may happen if the selector doesn’t exist or the page loads slowly. Double-check your selector and try increasing the
-
Scraping inside iframe
- Planned for future support.
-
xpath support
- Planned for future support.
-
robot.txt support
- Will be added as a configurable option in the future.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pw_simple_scraper-0.1.2.tar.gz.
File metadata
- Download URL: pw_simple_scraper-0.1.2.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af1f5c925ffc399205a8e23de85b23744680c5d517d484052db1ed91ca674023
|
|
| MD5 |
85041c43589156f67f4ce46634a812fc
|
|
| BLAKE2b-256 |
770c8cb06162b5a59dd3b70a51fb1a2d9e920d1c2e098410d3bfddd8dc36659e
|
Provenance
The following attestation bundles were made for pw_simple_scraper-0.1.2.tar.gz:
Publisher:
release.yml on elecbrandy/pw-simple-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pw_simple_scraper-0.1.2.tar.gz -
Subject digest:
af1f5c925ffc399205a8e23de85b23744680c5d517d484052db1ed91ca674023 - Sigstore transparency entry: 523015054
- Sigstore integration time:
-
Permalink:
elecbrandy/pw-simple-scraper@f07c6084baa635c6ca6b226850d199d404bf1b05 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/elecbrandy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f07c6084baa635c6ca6b226850d199d404bf1b05 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pw_simple_scraper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pw_simple_scraper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 10.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d56ff6e66044eadfee4b08fafa437ea773db4365a08093c0ba17b868997d3706
|
|
| MD5 |
1ce2409e8cbc40552ee20daf02f201e0
|
|
| BLAKE2b-256 |
c2975e3231261a22153a96d6a7d03a291976296fce24c87fd3d8817d978332a6
|
Provenance
The following attestation bundles were made for pw_simple_scraper-0.1.2-py3-none-any.whl:
Publisher:
release.yml on elecbrandy/pw-simple-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pw_simple_scraper-0.1.2-py3-none-any.whl -
Subject digest:
d56ff6e66044eadfee4b08fafa437ea773db4365a08093c0ba17b868997d3706 - Sigstore transparency entry: 523015072
- Sigstore integration time:
-
Permalink:
elecbrandy/pw-simple-scraper@f07c6084baa635c6ca6b226850d199d404bf1b05 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/elecbrandy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@f07c6084baa635c6ca6b226850d199d404bf1b05 -
Trigger Event:
push
-
Statement type: