Skip to main content

A simple and light Playwright-based scraper

Project description

pw-simple-scraper

PyPI Python License: MIT


‼️ Forget the hassle of creating browsers or setting headers. Just focus on scraping ‼️


Table of Contents



1. Main Features

  • A scraper library built on top of Playwright.
  • Automatically manages the lifecycle of browsers and pages with async with.
  • Returns Playwright objects, so you can use all the powerful Playwright features as they are.
  • ⚡️ Fast ⚡️


2. Installation

# 1. Install Playwright
pip install playwright

# 2-1. Install Chromium (macOS / Windows)
python -m playwright install chromium

# 2-2. Install Chromium (Linux)
python -m playwright install --with-deps chromium

# 3. Install pw-simple-scraper
pip install pw-simple-scraper
  • Since this scraper is based on Playwright, you need both the Playwright library and the Chromium browser.


3. How to Use

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference


  1. async with PlaywrightScraper() as scraper
    Create an instance of the scraper.
  2. async with scraper.get_page("http://www.example.com/") as page:
    Get a page context using the get_page method.
  3. Now you can directly use all the Playwright features on page.

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    # Create scraper instance
    async with PlaywrightScraper() as scraper:
        # Get page context
        async with scraper.get_page("http://www.example.com/") as page:
            # >>>> Use `page` in this block! <<<<


4. Examples

Not sure how to handle the Page object returned by get_page? -> Playwright Method Reference


4-1. Extract title / text / attributes

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://quotes.toscrape.com/") as page:
            title = await page.title()
            first_quote = await page.locator("span.text").first.text_content()
            quotes = await page.locator("span.text").all_text_contents()
            first_author_link = await page.locator(".quote a").first.get_attribute("href")

            print("Page Title:", title)
            print("First Quote:", first_quote)
            print("Quote List (first 3):", quotes[:3])
            print("First Author Link:", first_author_link)

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Page Title: Quotes to Scrape
First Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Quote List (first 3): ["The world as we have created it is a process of our thinking...", "It is our choices, Harry, that show what we truly are...", "There are only two ways to live your life..."]
First Author Link: /author/Albert-Einstein


4-2. Images & links — collect absolute paths

🖥️ Code Example

import asyncio
from urllib.parse import urljoin
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            img_urls = await page.locator("article.product_pod img").evaluate_all(
                "els => els.map(el => el.getAttribute('src'))"
            )
            abs_imgs = [urljoin(page.url, u) for u in img_urls if u]

            book_urls = await page.locator("article.product_pod h3 a").evaluate_all(
                "els => els.map(el => el.getAttribute('href'))"
            )
            abs_books = [urljoin(page.url, u) for u in book_urls if u]

            print("Image URLs (5):", abs_imgs[:5])
            print("Book Links (5):", abs_books[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

Image URLs (5): [
  'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
  'https://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
  'https://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
  'https://books.toscrape.com/media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg',
  'https://books.toscrape.com/media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg'
]
Book Links (5): [
  'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'https://books.toscrape.com/catalogue/soumission_998/index.html',
  'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
  'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html'
]


4-3. Evaluate JSON — convert DOM to JSON

🖥️ Code Example

import asyncio
from pw_simple_scraper import PlaywrightScraper

async def main():
    async with PlaywrightScraper() as scraper:
        async with scraper.get_page("https://books.toscrape.com/") as page:
            cards = page.locator("article.product_pod")
            items = await cards.evaluate_all("""
                els => els.map(el => ({
                    title: el.querySelector("h3 a")?.getAttribute("title"),
                    price: el.querySelector(".price_color")?.innerText.trim(),
                    inStock: !!el.querySelector(".instock.availability"),
                }))
            """)
            print(items[:5])

if __name__ == "__main__":
    asyncio.run(main())

⬇️ Example Output

[
  {"title": "A Light in the Attic", "price": "£51.77", "inStock": true},
  {"title": "Tipping the Velvet", "price": "£53.74", "inStock": true},
  {"title": "Soumission", "price": "£50.10", "inStock": true},
  {"title": "Sharp Objects", "price": "£47.82", "inStock": true},
  {"title": "Sapiens: A Brief History of Humankind", "price": "£54.23", "inStock": true}
]


5. Playwright Method Reference

  • If you’re not sure how to handle the Page object returned by get_page, check the table below.
  • 🚨 Note
    • HTML Attribute: <input value="default"> → always returns "default"
    • JS Property: input.value → changes to "user input" when typed
Category Method Description Notes / Comparison
Text all_text_contents() Returns a list of text from all elements Similar to all_inner_texts()
text_content() Returns visible text of the first element Similar to innerText (not textContent)
inner_text() Same as text_content() Actual visible text
all_inner_texts() List of visible text from all elements Similar to all_text_contents()
Attribute get_attribute('attr') Returns HTML attribute (href, src, class) Static, as written in HTML
Property get_property('prop') Returns live DOM property (value, checked) Useful for dynamic state
HTML / Value inner_html() Returns inner HTML of element Only inside structure
outer_html() Returns element’s full HTML Includes element itself
input_value() Returns current value of form elements More accurate than get_attribute('value')
select_option() Returns <option> info from <select> Shows selected state
State (Boolean) is_visible() Is element visible True/False
is_hidden() Is element hidden True/False
is_enabled() Is element enabled (clickable) True/False
is_disabled() Is element disabled True/False
is_editable() Is element editable True/False
is_checked() Is checkbox/radio checked True/False
Advanced evaluate("JS func", arg) Runs JS on first element Flexible extraction
evaluate_all("JS func", arg) Runs JS on all elements, returns list Useful for batch data


6. FAQ

  • Browser launch error after install

    • You must install the browser with:
      python -m playwright install chromium (check Linux options carefully)
  • Doesn’t work on some URLs

    • Please open a GitHub issue so we can check.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pw_simple_scraper-0.1.4.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pw_simple_scraper-0.1.4-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file pw_simple_scraper-0.1.4.tar.gz.

File metadata

  • Download URL: pw_simple_scraper-0.1.4.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pw_simple_scraper-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c519d796ea315018ecd408ace227aa47ded8cb3c7752d98af8d8b67486b1e160
MD5 266ce3fa9dd89cbf18701ef4831805f3
BLAKE2b-256 29b25fd648a407ee4a8fa0806f10736a26a8f59221da9d1877df22dc8dcc5aba

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.4.tar.gz:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pw_simple_scraper-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for pw_simple_scraper-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6933b79d3e5e114c462d9649cbdc9b9d689826f481b80535793818007bc2a868
MD5 56ea397a0dbdb64825fa38a182d4714e
BLAKE2b-256 89ea5f9573203081e4a238a9e43f313b90391a372cf68e97833c94c543e0b2ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for pw_simple_scraper-0.1.4-py3-none-any.whl:

Publisher: release.yml on elecbrandy/pw-simple-scraper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page