Skip to main content

Playwright request to make regular request for sites that blocks regular requests like www.amazon.com or www.tripadvisor.com

Project description

Playwright Request for Python

This is a library aiming to help programmers to create requests by using playwright browser and bypass sites like www.amazon.com, www.airbnb.com or www.tripadvisor.com in general, all sites that block regular requests or require a proxy to crawl pages in parallel.

With PlaywrightRequest you can process many urls asynchronously (at high speed) and parse the htmls or create a function to process every open page that requires extra work, useful to get information hidden until a user interacts with the page, for example when you need scrape images but the site requires you to click on a button to open a popup window and then get the images-src and then close the popup window.

This library contains: the code to perform requests with the ability to extend and manipulate pages.

Installation

pip install playwright-request
playwright install
playwright install-deps

Installation on docker images

It's probable to have caveats when working with docker images, to avoid this issues, you must include the following code in your Dockerfile

RUN apt-get update && \
    apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    build-essential \
    python3-dev \
    python3-setuptools \
    gcc \
    make \
    apt-utils \
    libxcb-shm0  \
    libx11-xcb-dev  \
    libxext-dev  \
    libxrandr-dev  \
    libxcomposite-dev  \
    libxcursor-dev  \
    libxdamage-dev  \
    libxi-dev \
    libxtst-dev \
    libgtk-3-dev \
    libasound-dev \
    libdbus-glib-1-dev
    
pip install playwright
playwright install
playwright install-deps

Use in cloud environments like GCP or AWS

In order to use PlaywrightRequest in AWS or GCP it's necessary to create a docker image with your code (include the code shown above). Running your docker image locally is straightforward but in the cloud there is a little issue because the command playwright install-deps install playwright with the user root. In the cloud, the docker image is execute by a random user and that user is not able to find playwright browsers... the solution is to include the following command in the code you use PlaywrightRequest and install the browsers you need locally

import playwright
import os

# use only one of the following commands depending on your needs
os.system("playwright install")  # use this to install all browsers
os.system("playwright install firefox")  # to install firefox
os.system("playwright install chromium")  # to install chromium
os.system("playwright install webkit")  # to install webkit

Usage

Example #1: simple usage

from playwright_request.playwright_request import PlaywrightRequest
#crawl
requester = PlaywrightRequest()
responses = requester.get(urls=["SITE1","SITE2"])

print(responses[0].status_code, responses[0].html)
print(responses[1].status_code, responses[1].html)

Example #2: simple usage with Chromium

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.browser_type import BrowserType
#crawl
requester = PlaywrightRequest(browser=BrowserType.CHROMIUM, headless=False)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #3: define interceptor to avoid loading images

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor
#crawl
interceptor = RouteInterceptor().set_default_exclusions()
requester = PlaywrightRequest(route_interceptor=interceptor)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #4: extra processing results

from playwright.async_api import Page
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor


async def get_all_photos(page: Page) -> list[str]:
    # 1. Click on show all button and popup photo window
    page.locator('button', has_text='Show all photos').click()
    # 2. Wait for state is loaded and then wait for the selector
    page.wait_for_load_state('networkidle', timeout=3000)
    page.wait_for_load_state(timeout=3000)
    page.wait_for_selector('div[data-testid=photo-viewer-section]', timeout=3000)
    # 3. get photo section selector
    photos_section = page.query_selector('div[data-testid=photo-viewer-section]')
    # 4. get all picture elements within 
    all_pictures = photos_section.query_selector_all('picture')
    # 5. get all selector images and extract the attribute we need
    images = [a.query_selector("img").get_attribute("data-original-uri") for a in all_pictures]
    # 6. close the popup window and return the images
    page.locator('//button[@aria-label="Close"]').click()
    return images

requester = PlaywrightRequest(extra_async_function_ptr=get_all_photos)
responses = requester.get(urls=[f"SITE-{k}" for k in range(100)])
images = [response.extra_result for response in responses] 

for response in responses:
    images = response.extra_result
    print(response.status_code, len(images))

Example 5: detect amazon error pages

from playwright_request.commom_error_page_detectors.amazon_error_page_detector import AmazonErrorPageDetector
from playwright_request.playwright_request import PlaywrightRequest

amazon_detector = AmazonErrorPageDetector()
requester = PlaywrightRequest(error_page_detectors=[amazon_detector])
responses = requester.get(urls=[f"AMAZON-ASIN-{k}" for k in range(100)])

valid_htmls = [response.html for response in responses if response.status_code==200 and not response.error_list]

Author

Pedro Mayorga.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

playwright-request-1.5.0.tar.gz (17.1 kB view details)

Uploaded Source

Built Distribution

playwright_request-1.5.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file playwright-request-1.5.0.tar.gz.

File metadata

  • Download URL: playwright-request-1.5.0.tar.gz
  • Upload date:
  • Size: 17.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.8

File hashes

Hashes for playwright-request-1.5.0.tar.gz
Algorithm Hash digest
SHA256 e81c168c9334252f6996b623ef18e2f30a994fd3033d00d5b595296cb616e1f3
MD5 b2b3d3c24ef34a10d2c7c4e0d5c1b3d9
BLAKE2b-256 75d4e57bbf472d65251f39b2d00620671eff37dabf48c9d2093bf1f2ee73ca09

See more details on using hashes here.

File details

Details for the file playwright_request-1.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for playwright_request-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5d3d71a0e2ee997ff7050a89819be44b6bba14ff8e78ae385846d23943a3072f
MD5 787d065e57f933b43f02c40834ac17a2
BLAKE2b-256 9abddcbc859ff2950b3a44a2ea5275834a4495ee3fd85fa7aeef37705f06f0bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page