Playwright request to make regular request for sites that blocks regular requests like www.amazon.com or www.tripadvisor.com

These details have not been verified by PyPI

Project description

Playwright Request for Python

This is a library aiming to help programmers to create requests by using playwright browser and bypass sites like www.amazon.com, www.airbnb.com or www.tripadvisor.com in general, all sites that block regular requests or require a proxy to crawl pages in parallel.

With PlaywrightRequest you can process many urls asynchronously (at high speed) and parse the htmls or create a function to process every open page that requires extra work, useful to get information hidden until a user interacts with the page, for example when you need scrape images but the site requires you to click on a button to open a popup window and then get the images-src and then close the popup window.

This library contains: the code to perform requests with the ability to extend and manipulate pages.

Installation

pip install playwright-request
playwright install
playwright install-deps

Installation on docker images

It's probable to have caveats when working with docker images, to avoid this issues, you must include the following code in your Dockerfile

RUN apt-get update && \
    apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg \
    build-essential \
    python3-dev \
    python3-setuptools \
    gcc \
    make \
    apt-utils \
    libxcb-shm0  \
    libx11-xcb-dev  \
    libxext-dev  \
    libxrandr-dev  \
    libxcomposite-dev  \
    libxcursor-dev  \
    libxdamage-dev  \
    libxi-dev \
    libxtst-dev \
    libgtk-3-dev \
    libasound-dev \
    libdbus-glib-1-dev
    
pip install playwright
playwright install
playwright install-deps

Use in cloud environments like GCP or AWS

In order to use PlaywrightRequest in AWS or GCP it's necessary to create a docker image with your code (include the code shown above). Running your docker image locally is straightforward but in the cloud there is a little issue because the command playwright install-deps install playwright with the user root. In the cloud, the docker image is execute by a random user and that user is not able to find playwright browsers... the solution is to include the following command in the code you use PlaywrightRequest and install the browsers you need locally

import playwright
import os

# use only one of the following commands depending on your needs
os.system("playwright install")  # use this to install all browsers
os.system("playwright install firefox")  # to install firefox
os.system("playwright install chromium")  # to install chromium
os.system("playwright install webkit")  # to install webkit

Usage

Example #1: simple usage

from playwright_request.playwright_request import PlaywrightRequest
#crawl
requester = PlaywrightRequest()
responses = requester.get(urls=["SITE1","SITE2"])

print(responses[0].status_code, responses[0].html)
print(responses[1].status_code, responses[1].html)

Example #2: simple usage with Chromium

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.browser_type import BrowserType
#crawl
requester = PlaywrightRequest(browser=BrowserType.CHROMIUM, headless=False)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #3: define interceptor to avoid loading images

from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor
#crawl
interceptor = RouteInterceptor().set_default_exclusions()
requester = PlaywrightRequest(route_interceptor=interceptor)
responses = requester.get(urls=["SITE1"])

print(responses[0].status_code, responses[0].html)

Example #4: extra processing results

from playwright.async_api import Page
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor


async def get_all_photos(page: Page) -> list[str]:
    # 1. Click on show all button and popup photo window
    page.locator('button', has_text='Show all photos').click()
    # 2. Wait for state is loaded and then wait for the selector
    page.wait_for_load_state('networkidle', timeout=3000)
    page.wait_for_load_state(timeout=3000)
    page.wait_for_selector('div[data-testid=photo-viewer-section]', timeout=3000)
    # 3. get photo section selector
    photos_section = page.query_selector('div[data-testid=photo-viewer-section]')
    # 4. get all picture elements within 
    all_pictures = photos_section.query_selector_all('picture')
    # 5. get all selector images and extract the attribute we need
    images = [a.query_selector("img").get_attribute("data-original-uri") for a in all_pictures]
    # 6. close the popup window and return the images
    page.locator('//button[@aria-label="Close"]').click()
    return images

requester = PlaywrightRequest(extra_async_function_ptr=get_all_photos)
responses = requester.get(urls=[f"SITE-{k}" for k in range(100)])
images = [response.extra_result for response in responses] 

for response in responses:
    images = response.extra_result
    print(response.status_code, len(images))

Example 5: detect amazon error pages

from playwright_request.commom_error_page_detectors.amazon_error_page_detector import AmazonErrorPageDetector
from playwright_request.playwright_request import PlaywrightRequest

amazon_detector = AmazonErrorPageDetector()
requester = PlaywrightRequest(error_page_detectors=[amazon_detector])
responses = requester.get(urls=[f"AMAZON-ASIN-{k}" for k in range(100)])

valid_htmls = [response.html for response in responses if response.status_code==200 and not response.error_list]

Author

Pedro Mayorga.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.5.0

Feb 29, 2024

1.4.0

Dec 20, 2023

1.3.0

Dec 2, 2023

1.2.7

Nov 30, 2023

1.2.6

Nov 30, 2023

This version

1.2.5

Nov 29, 2023

1.2.4

Nov 28, 2023

1.2.3

Nov 26, 2023

1.2.2

Nov 26, 2023

1.2.1

Nov 25, 2023

1.2.0

Nov 23, 2023

1.1.4

Nov 21, 2023

1.1.3

Nov 21, 2023

1.1.2

Nov 21, 2023

1.1.1

Nov 19, 2023

1.1.0

Nov 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

playwright-request-1.2.5.tar.gz (15.7 kB view details)

Uploaded Nov 29, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

playwright_request-1.2.5-py3-none-any.whl (14.9 kB view details)

Uploaded Nov 29, 2023 Python 3

File details

Details for the file playwright-request-1.2.5.tar.gz.

File metadata

Download URL: playwright-request-1.2.5.tar.gz
Upload date: Nov 29, 2023
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for playwright-request-1.2.5.tar.gz
Algorithm	Hash digest
SHA256	`5f3992b1ed376774f26060847f75631707769d7eed169c73f59111ec01dc8e6a`
MD5	`6d6dcdb70b938ac1314c025cb8b9f0e7`
BLAKE2b-256	`6b19e5ce3c64815c7cf44112735088414ab6a63092bd78b2d637a7ec5ce5e1bb`

See more details on using hashes here.

File details

Details for the file playwright_request-1.2.5-py3-none-any.whl.

File metadata

Download URL: playwright_request-1.2.5-py3-none-any.whl
Upload date: Nov 29, 2023
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for playwright_request-1.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa34d141041594c3f9cf35708757d8246523c3e792290d4097f00487ea0ba14b`
MD5	`82ec4506837efd50dad7b31eef6bb35d`
BLAKE2b-256	`14c45aaa84bf83d99be777c791a1e2ee21efc16fc933a5c344ee3aeeabc9e05b`

See more details on using hashes here.

playwright-request 1.2.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Playwright Request for Python

Installation

Installation on docker images

Use in cloud environments like GCP or AWS

Usage

Example #1: simple usage

Example #2: simple usage with Chromium

Example #3: define interceptor to avoid loading images

Example #4: extra processing results

Example 5: detect amazon error pages

Author

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes