Skip to main content

A simple and light Playwright-based scraper

Project description

pw-simple-scraper

PyPI Python License: MIT


‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️


Table of Contents



1. Main Features

  • A scraper library built on top of Playwright.
  • Automatically manages the lifecycle of browsers and pages with async with.
  • Returns Playwright objects, so you can use all the powerful Playwright features as they are.
  • ⚡️ Fast ⚡️


2. Installation

# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper
  • Since this scraper is based on Playwright, you need both the Playwright library and the Chromium browser.


3. How to Use

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference


  1. async with PlaywrightScraper() as scraper
    Create an instance of the scraper.
  2. async with scraper.get_page("http://www.example.com/") as page:
    Get a page context using the get_page method.
  3. Now you can directly use all the Playwright features on page.

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    # Create scraper instance
    async with PlaywrightScraper() as scraper:
        # Get page context
        async with scraper.get_page("http://www.example.com/") as page:
            # >>>> Use `page` in this block! <<<<


4. Examples

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference

4-1. Extract title / text / attributes

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://quotes.toscrape.com/") as page:
            title = await page.title()
            first_quote = await page.locator("span.text").first.text_content()
            quotes = await page.locator("span.text").all_text_contents()
            first_author_link = await page.locator(".quote a").first.get_attribute("href")

            print("Page Title:", title)
            print("First Quote:", first_quote)
            print("Quote List (first 3):", quotes[:3])
            print("First Author Link:", first_author_link)

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein

4-2. Images & links — collect absolute paths

🖥️ Code Example

import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            img_urls = await page.locator("article.product_pod img").evaluate_all(
                "els => els.map(el => el.getAttribute('src'))"
            )
            abs_imgs = [urljoin(page.url, u) for u in img_urls if u]

            book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
                "els => els.map(el => el.getAttribute('href'))"
            )
            abs_books = [urljoin(page.url, u) for u in book_urls if u]

            print("Image URLs (5):", abs_imgs[:5])
            print("Book Links (5):", abs_books[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Image URLs (5): [
  'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
  'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
  'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
  'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
  'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]

4-3. Evaluate JSON — convert DOM to JSON

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            cards = page.locator("article.product_pod")
            items = await cards.evaluate_all("""
                els => els.map(el => ({
                    title: el.querySelector("h3 a")?.getAttribute("title"),
                    price: el.querySelector(".price_color")?.innerText.trim(),
                    inStock: !!el.querySelector(".instock.availability"),
                }))
            """)
            print(items[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

[
  {"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
  {"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
  {"title": "Soumission", "price": "£50.10", "inStock": true},
  {"title": "Sharp Objects", "price": "£47.82", "inStock": true},
  {"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]


5. Playwright Method Reference

  • If you’re not sure how to handle the Page object returned by get_page, check the table below.
  • 🚨 Note
    • HTML Attribute: <input value="default"> → always returns "default"
    • JS Property: input.value → changes to "user input" when typed
Category Method Description Notes / Comparison
Text all_text_contents() Returns a list of text from all elements Similar to all_inner_texts()
text_content() Returns visible text of the first element Similar to innerText (not textContent)
inner_text() Same as text_content() Actual visible text
all_inner_texts() List of visible text from all elements Similar to all_text_contents()
Attribute get_attribute('attr') Returns HTML attribute (href, src, class) Static, as written in HTML
Property get_property('prop') Returns live DOM property (value, checked) Useful for dynamic state
HTML / Value inner_html() Returns inner HTML of element Only inside structure
outer_html() Returns element’s full HTML Includes element itself
input_value() Returns current value of form elements More accurate than get_attribute('value')
select_option() Returns <option> info from <select> Shows selected state
State (Boolean) is_visible() Is element visible True/False
is_hidden() Is element hidden True/False
is_enabled() Is element enabled (clickable) True/False
is_disabled() Is element disabled True/False
is_editable() Is element editable True/False
is_checked() Is checkbox/radio checked True/False
Advanced evaluate("JS func", arg) Runs JS on first element Flexible extraction
evaluate_all("JS func", arg) Runs JS on all elements, returns list Useful for batch data


6. FAQ

  • Browser launch error after install

    • You must install the browser with:
      python -m playwright install chromium (check Linux options carefully)
  • Doesn’t work on some URLs

    • Please open a GitHub issue so we can check.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pw_simple_scraper-0.1.3.tar.gz (41.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pw_simple_scraper-0.1.3-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file pw_simple_scraper-0.1.3.tar.gz.

File metadata

  • Download URL: pw_simple_scraper-0.1.3.tar.gz
  • Upload date:
  • Size: 41.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pw_simple_scraper-0.1.3.tar.gz
Algorithm Hash digest
SHA256 80c115b18549ce6f6b181147ba7663a408f2ea6bd2df420a9f607b07045d81b2
MD5 71d3c382c646bd0c67396eacc531e628
BLAKE2b-256 cda2f520bda508b3ce2e424e5335e3608a3f898e1b90c62840eb548777452d94

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.3.tar.gz:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pw_simple_scraper-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pw_simple_scraper-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 acac4fccdd7643b3ed0aa6ced4e8100e6151c5641c75c84c9f318c4ca61dbdae
MD5 3b4a00318bec97a76841e14968e0ad94
BLAKE2b-256 ebf7048131db6c0f72d21d417b2b6c2c6d0f40756cebd7228ad921046723e1a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.3-py3-none-any.whl:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page