Skip to main content

Playwright request to make regular request for sites that blocks regular requests like www.amazon.com or www.tripadvisor.com

Project description

Playwright Request for Python

This is a library aiming to help programmers to create requests by using playwright browser and bypass sites like www.amazon.com, www.airbnb.com or www.tripadvisor.com in general, all sites that block regular requests or require a proxy to crawl pages in parallel.

With PlaywrightRequest you can process many urls asynchronously (at high speed) and parse the htmls or create a function to process every open page that requires extra work, useful to get information hidden until a user interacts with the page, for example when you need scrape images but the site requires you to click on a button to open a popup window and then get the images-src and then close the popup window.

This library contains: the code to perform requests with the ability to extend and manipulate pages.

Installation

pip install playwright-request
playwright install
playwright install-deps

Installation on docker images

It's probable to have caveats when working with docker images, to avoid this issues, you must include the following code in your Dockerfile

RUN apt-get update && \
    apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    build-essential \
    python3-dev \
    python3-setuptools \
    gcc \
    make \
    apt-utils \
    libxcb-shm0  \
    libx11-xcb-dev  \
    libxext-dev  \
    libxrandr-dev  \
    libxcomposite-dev  \
    libxcursor-dev  \
    libxdamage-dev  \
    libxi-dev \
    libxtst-dev \
    libgtk-3-dev \
    libasound-dev \
    libdbus-glib-1-dev
    
pip install playwright
playwright install
playwright install-deps

Use in cloud environments like GCP or AWS

In order to use PlaywrightRequest in AWS or GCP it's necessary to create a docker image with your code (include the code shown above). Running your docker image locally is straightforward but in the cloud there is a little issue because the command playwright install-deps install playwright with the user root. In the cloud, the docker image is execute by a random user and that user is not able to find playwright browsers... the solution is to include the following command in the code you use PlaywrightRequest and install the browsers you need locally

import playwright
import os

# use only one of the following commands depending on your needs
os.system("playwright install")  # use this to install all browsers
os.system("playwright install firefox")  # to install firefox
os.system("playwright install chromium")  # to install chromium
os.system("playwright install webkit")  # to install webkit

Usage

Example #1: simple usage

from playwright_request.playwright_request import PlaywrightRequest
#crawl
requester = PlaywrightRequest()
responses = requester.get(urls=["SITE1","SITE2"])

print(responses[0].status_code, responses[0].html)
print(responses[1].status_code, responses[1].html)

Example #2: simple usage with Chromium

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.browser_type import BrowserType
#crawl
requester = PlaywrightRequest(browser=BrowserType.CHROMIUM, headless=False)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #3: define interceptor to avoid loading images

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor
#crawl
interceptor = RouteInterceptor().set_default_exclusions()
requester = PlaywrightRequest(route_interceptor=interceptor)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #4: extra processing results

from playwright.async_api import Page
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor


async def get_all_photos(page: Page) -> list[str]:
    # 1. Click on show all button and popup photo window
    page.locator('button', has_text='Show all photos').click()
    # 2. Wait for state is loaded and then wait for the selector
    page.wait_for_load_state('networkidle', timeout=3000)
    page.wait_for_load_state(timeout=3000)
    page.wait_for_selector('div[data-testid=photo-viewer-section]', timeout=3000)
    # 3. get photo section selector
    photos_section = page.query_selector('div[data-testid=photo-viewer-section]')
    # 4. get all picture elements within 
    all_pictures = photos_section.query_selector_all('picture')
    # 5. get all selector images and extract the attribute we need
    images = [a.query_selector("img").get_attribute("data-original-uri") for a in all_pictures]
    # 6. close the popup window and return the images
    page.locator('//button[@aria-label="Close"]').click()
    return images

requester = PlaywrightRequest(extra_async_function_ptr=get_all_photos)
responses = requester.get(urls=[f"SITE-{k}" for k in range(100)])
images = [response.extra_result for response in responses] 

for response in responses:
    images = response.extra_result
    print(response.status_code, len(images))

Example 5: detect amazon error pages

from playwright_request.commom_error_page_detectors.amazon_error_page_detector import AmazonErrorPageDetector
from playwright_request.playwright_request import PlaywrightRequest

amazon_detector = AmazonErrorPageDetector()
requester = PlaywrightRequest(error_page_detectors=[amazon_detector])
responses = requester.get(urls=[f"AMAZON-ASIN-{k}" for k in range(100)])

valid_htmls = [response.html for response in responses if response.status_code==200 and not response.error_list]

Author

Pedro Mayorga.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

playwright-request-1.2.5.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

playwright_request-1.2.5-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file playwright-request-1.2.5.tar.gz.

File metadata

  • Download URL: playwright-request-1.2.5.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for playwright-request-1.2.5.tar.gz
Algorithm Hash digest
SHA256 5f3992b1ed376774f26060847f75631707769d7eed169c73f59111ec01dc8e6a
MD5 6d6dcdb70b938ac1314c025cb8b9f0e7
BLAKE2b-256 6b19e5ce3c64815c7cf44112735088414ab6a63092bd78b2d637a7ec5ce5e1bb

See more details on using hashes here.

File details

Details for the file playwright_request-1.2.5-py3-none-any.whl.

File metadata

File hashes

Hashes for playwright_request-1.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 aa34d141041594c3f9cf35708757d8246523c3e792290d4097f00487ea0ba14b
MD5 82ec4506837efd50dad7b31eef6bb35d
BLAKE2b-256 14c45aaa84bf83d99be777c791a1e2ee21efc16fc933a5c344ee3aeeabc9e05b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page