A simple and light Playwright-based scraper

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

elecbrandy

These details have not been verified by PyPI

Project description

pw-simple-scraper

‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️

1. Main Features
2. Installation
3. How to Use
4. Examples
5. Playwright Method Reference
6. FAQ

1. Main Features

A scraper library built on top of Playwright.
Automatically manages the lifecycle of browsers and pages with async with.
Returns Playwright objects, so you can use all the powerful Playwright features as they are.
⚡️ Fast ⚡️

2. Installation

# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper

Since this scraper is based on Playwright, you need both the Playwright library and the Chromium browser.

3. How to Use

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference

async with PlaywrightScraper() as scraper
Create an instance of the scraper.
async with scraper.get_page("http://www.example.com/") as page:
Get a page context using the get_page method.
Now you can directly use all the Playwright features on page.

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    # Create scraper instance
    async with PlaywrightScraper() as scraper:
        # Get page context
        async with scraper.get_page("http://www.example.com/") as page:
            # >>>> Use `page` in this block! <<<<

4. Examples

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference

4-1. Extract title / text / attributes

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://quotes.toscrape.com/") as page:
            title = await page.title()
            first_quote = await page.locator("span.text").first.text_content()
            quotes = await page.locator("span.text").all_text_contents()
            first_author_link = await page.locator(".quote a").first.get_attribute("href")

            print("Page Title:", title)
            print("First Quote:", first_quote)
            print("Quote List (first 3):", quotes[:3])
            print("First Author Link:", first_author_link)

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein

4-2. Images & links — collect absolute paths

🖥️ Code Example

import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            img_urls = await page.locator("article.product_pod img").evaluate_all(
                "els => els.map(el => el.getAttribute('src'))"
            )
            abs_imgs = [urljoin(page.url, u) for u in img_urls if u]

            book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
                "els => els.map(el => el.getAttribute('href'))"
            )
            abs_books = [urljoin(page.url, u) for u in book_urls if u]

            print("Image URLs (5):", abs_imgs[:5])
            print("Book Links (5):", abs_books[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Image URLs (5): [
  'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
  'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
  'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
  'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
  'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]

4-3. Evaluate JSON — convert DOM to JSON

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            cards = page.locator("article.product_pod")
            items = await cards.evaluate_all("""
                els => els.map(el => ({
                    title: el.querySelector("h3 a")?.getAttribute("title"),
                    price: el.querySelector(".price_color")?.innerText.trim(),
                    inStock: !!el.querySelector(".instock.availability"),
                }))
            """)
            print(items[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

[
  {"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
  {"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
  {"title": "Soumission", "price": "£50.10", "inStock": true},
  {"title": "Sharp Objects", "price": "£47.82", "inStock": true},
  {"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]

5. Playwright Method Reference

If you’re not sure how to handle the Page object returned by get_page, check the table below.
🚨 Note
- HTML Attribute: <input value="default"> → always returns "default"
- JS Property: input.value → changes to "user input" when typed

Category	Method	Description	Notes / Comparison
Text	`all_text_contents()`	Returns a list of text from all elements	Similar to `all_inner_texts()`
	`text_content()`	Returns visible text of the first element	Similar to `innerText` (not `textContent`)
	`inner_text()`	Same as `text_content()`	Actual visible text
	`all_inner_texts()`	List of visible text from all elements	Similar to `all_text_contents()`
Attribute	`get_attribute('attr')`	Returns HTML attribute (`href`, `src`, `class`)	Static, as written in HTML
Property	`get_property('prop')`	Returns live DOM property (`value`, `checked`)	Useful for dynamic state
HTML / Value	`inner_html()`	Returns inner HTML of element	Only inside structure
	`outer_html()`	Returns element’s full HTML	Includes element itself
	`input_value()`	Returns current value of form elements	More accurate than `get_attribute('value')`
	`select_option()`	Returns `<option>` info from `<select>`	Shows selected state
State (Boolean)	`is_visible()`	Is element visible	True/False
	`is_hidden()`	Is element hidden	True/False
	`is_enabled()`	Is element enabled (clickable)	True/False
	`is_disabled()`	Is element disabled	True/False
	`is_editable()`	Is element editable	True/False
	`is_checked()`	Is checkbox/radio checked	True/False
Advanced	`evaluate("JS func", arg)`	Runs JS on first element	Flexible extraction
	`evaluate_all("JS func", arg)`	Runs JS on all elements, returns list	Useful for batch data

6. FAQ

Browser launch error after install
- You must install the browser with:
  python -m playwright install chromium (check Linux options carefully)
Doesn’t work on some URLs
- Please open a GitHub issue so we can check.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

elecbrandy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

Oct 12, 2025

0.1.3

Sep 22, 2025

0.1.2

Sep 16, 2025

0.1.1

Sep 12, 2025

0.1.0

Aug 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pw_simple_scraper-0.1.4.tar.gz (41.7 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pw_simple_scraper-0.1.4-py3-none-any.whl (8.6 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file pw_simple_scraper-0.1.4.tar.gz.

File metadata

Download URL: pw_simple_scraper-0.1.4.tar.gz
Upload date: Oct 12, 2025
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pw_simple_scraper-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160`
MD5	`266ce3fa9dd89cbf18701ef4831805f3`
BLAKE2b-256	`29b25fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.4.tar.gz:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pw_simple_scraper-0.1.4.tar.gz
- Subject digest: c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160
- Sigstore transparency entry: 601211913
- Sigstore integration time: Oct 12, 2025
Source repository:
- Permalink: elecbrandy/pw-simple-scraper@4e33b3f0a1db543b63e9164e6daf75840ae7cdda
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/elecbrandy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4e33b3f0a1db543b63e9164e6daf75840ae7cdda
- Trigger Event: push

File details

Details for the file pw_simple_scraper-0.1.4-py3-none-any.whl.

File metadata

Download URL: pw_simple_scraper-0.1.4-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 8.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pw_simple_scraper-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868`
MD5	`56ea397a0dbdb64825fa38a182d4714e`
BLAKE2b-256	`89ea5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.4-py3-none-any.whl:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pw_simple_scraper-0.1.4-py3-none-any.whl
- Subject digest: 6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868
- Sigstore transparency entry: 601211914
- Sigstore integration time: Oct 12, 2025
Source repository:
- Permalink: elecbrandy/pw-simple-scraper@4e33b3f0a1db543b63e9164e6daf75840ae7cdda
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/elecbrandy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@4e33b3f0a1db543b63e9164e6daf75840ae7cdda
- Trigger Event: push

pw-simple-scraper 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

pw-simple-scraper

Table of Contents

1. Main Features

2. Installation

3. How to Use

🖥️ Code Example

4. Examples

4-1. Extract title / text / attributes

🖥️ Code Example

⬇️ Example Output

4-2. Images & links — collect absolute paths

🖥️ Code Example

⬇️ Example Output

4-3. Evaluate JSON — convert DOM to JSON

🖥️ Code Example

⬇️ Example Output

5. Playwright Method Reference

6. FAQ

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance