A simple and light Playwright-based scraper
Project description
pw-simple-scraper
‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️
Table of Contents
1. Main Features
- A scraper library built on top of Playwright.
- Automatically manages the lifecycle of browsers and pages with
async with. - Returns Playwright objects, so you can use all the powerful Playwright features as they are.
- ⚡️ Fast ⚡️
2. Installation
# 1. Install Playwright
pip install playwright
# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium
# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium
# 3. Install pw-simple-scraper
pip install pw-simple-scraper
- Since this scraper is based on
Playwright, you need both thePlaywrightlibrary and theChromiumbrowser.
3. How to Use
Not sure how to handle the
Pageobject returned byget_page? -> Playwright Method Reference
async with PlaywrightScraper() as scraper
Create an instance of the scraper.async with scraper.get_page("http://www.example.com/") as page:
Get a page context using theget_pagemethod.- Now you can directly use all the Playwright features on
page.
🖥️ Code Example
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
# Create scraper instance
async with PlaywrightScraper() as scraper:
# Get page context
async with scraper.get_page("http://www.example.com/") as page:
# >>>> Use `page` in this block! <<<<
4. Examples
Not sure how to handle the
Pageobject returned byget_page? -> Playwright Method Reference
4-1. Extract title / text / attributes
🖥️ Code Example
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://quotes.toscrape.com/") as page:
title = await page.title()
first_quote = await page.locator("span.text").first.text_content()
quotes = await page.locator("span.text").all_text_contents()
first_author_link = await page.locator(".quote a").first.get_attribute("href")
print("Page Title:", title)
print("First Quote:", first_quote)
print("Quote List (first 3):", quotes[:3])
print("First Author Link:", first_author_link)
if __name__ == "__main__":
asyncio.run(main())
⬇️ Example Output
Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein
4-2. Images & links — collect absolute paths
🖥️ Code Example
import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://books.toscrape.com/") as page:
img_urls = await page.locator("article.product_pod img").evaluate_all(
"els => els.map(el => el.getAttribute('src'))"
)
abs_imgs = [urljoin(page.url, u) for u in img_urls if u]
book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
"els => els.map(el => el.getAttribute('href'))"
)
abs_books = [urljoin(page.url, u) for u in book_urls if u]
print("Image URLs (5):", abs_imgs[:5])
print("Book Links (5):", abs_books[:5])
if __name__ == "__main__":
asyncio.run(main())
⬇️ Example Output
Image URLs (5): [
'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
'https://books.toscrape.com/catalogue/soumission_998/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]
4-3. Evaluate JSON — convert DOM to JSON
🖥️ Code Example
import asyncio
from pw_simple_scraper import PlaywrightScraper
async def main():
async with PlaywrightScraper() as scraper:
async with scraper.get_page("https://books.toscrape.com/") as page:
cards = page.locator("article.product_pod")
items = await cards.evaluate_all("""
els => els.map(el => ({
title: el.querySelector("h3 a")?.getAttribute("title"),
price: el.querySelector(".price_color")?.innerText.trim(),
inStock: !!el.querySelector(".instock.availability"),
}))
""")
print(items[:5])
if __name__ == "__main__":
asyncio.run(main())
⬇️ Example Output
[
{"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
{"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
{"title": "Soumission", "price": "£50.10", "inStock": true},
{"title": "Sharp Objects", "price": "£47.82", "inStock": true},
{"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]
5. Playwright Method Reference
- If you’re not sure how to handle the
Pageobject returned byget_page, check the table below. - 🚨 Note
- HTML Attribute:
<input value="default">→ always returns"default" - JS Property:
input.value→ changes to"user input"when typed
- HTML Attribute:
| Category | Method | Description | Notes / Comparison |
|---|---|---|---|
| Text | all_text_contents() |
Returns a list of text from all elements | Similar to all_inner_texts() |
text_content() |
Returns visible text of the first element | Similar to innerText (not textContent) |
|
inner_text() |
Same as text_content() |
Actual visible text | |
all_inner_texts() |
List of visible text from all elements | Similar to all_text_contents() |
|
| Attribute | get_attribute('attr') |
Returns HTML attribute (href, src, class) |
Static, as written in HTML |
| Property | get_property('prop') |
Returns live DOM property (value, checked) |
Useful for dynamic state |
| HTML / Value | inner_html() |
Returns inner HTML of element | Only inside structure |
outer_html() |
Returns element’s full HTML | Includes element itself | |
input_value() |
Returns current value of form elements | More accurate than get_attribute('value') |
|
select_option() |
Returns <option> info from <select> |
Shows selected state | |
| State (Boolean) | is_visible() |
Is element visible | True/False |
is_hidden() |
Is element hidden | True/False | |
is_enabled() |
Is element enabled (clickable) | True/False | |
is_disabled() |
Is element disabled | True/False | |
is_editable() |
Is element editable | True/False | |
is_checked() |
Is checkbox/radio checked | True/False | |
| Advanced | evaluate("JS func", arg) |
Runs JS on first element | Flexible extraction |
evaluate_all("JS func", arg) |
Runs JS on all elements, returns list | Useful for batch data |
6. FAQ
-
Browser launch error after install
- You must install the browser with:
python -m playwright install chromium(check Linux options carefully)
- You must install the browser with:
-
Doesn’t work on some URLs
- Please open a GitHub issue so we can check.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pw_simple_scraper-0.1.4.tar.gz.
File metadata
- Download URL: pw_simple_scraper-0.1.4.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160
|
|
| MD5 |
266ce3fa9dd89cbf18701ef4831805f3
|
|
| BLAKE2b-256 |
29b25fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba
|
Provenance
The following attestation bundles were made for pw_simple_scraper-0.1.4.tar.gz:
Publisher:
release.yml on elecbrandy/pw-simple-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pw_simple_scraper-0.1.4.tar.gz -
Subject digest:
c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160 - Sigstore transparency entry: 601211913
- Sigstore integration time:
-
Permalink:
elecbrandy/pw-simple-scraper@4e33b3f0a1db543b63e9164e6daf75840ae7cdda -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/elecbrandy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4e33b3f0a1db543b63e9164e6daf75840ae7cdda -
Trigger Event:
push
-
Statement type:
File details
Details for the file pw_simple_scraper-0.1.4-py3-none-any.whl.
File metadata
- Download URL: pw_simple_scraper-0.1.4-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868
|
|
| MD5 |
56ea397a0dbdb64825fa38a182d4714e
|
|
| BLAKE2b-256 |
89ea5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce
|
Provenance
The following attestation bundles were made for pw_simple_scraper-0.1.4-py3-none-any.whl:
Publisher:
release.yml on elecbrandy/pw-simple-scraper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pw_simple_scraper-0.1.4-py3-none-any.whl -
Subject digest:
6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868 - Sigstore transparency entry: 601211914
- Sigstore integration time:
-
Permalink:
elecbrandy/pw-simple-scraper@4e33b3f0a1db543b63e9164e6daf75840ae7cdda -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/elecbrandy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4e33b3f0a1db543b63e9164e6daf75840ae7cdda -
Trigger Event:
push
-
Statement type: