Playwright request to make regular request for sites that blocks regular requests like www.amazon.com or www.tripadvisor.com
Project description
Playwright Request for Python
This is a library aiming to help programmers to create requests by using playwright browser and bypass sites like www.amazon.com, www.airbnb.com or www.tripadvisor.com in general, all sites that block regular requests or require a proxy to crawl pages in parallel.
With PlaywrightRequest you can process many urls asynchronously (at high speed) and parse the htmls or create a function to process every open page that requires extra work, useful to get information hidden until a user interacts with the page, for example when you need scrape images but the site requires you to click on a button to open a popup window and then get the images-src and then close the popup window.
This library contains: the code to perform requests with the ability to extend and manipulate pages.
Installation
pip install playwright-request
playwright install
playwright install-deps
Installation on docker images
It's probable to have caveats when working with docker images, to avoid this issues,
you must include the following code in your Dockerfile
RUN apt-get update && \
apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg \
build-essential \
python3-dev \
python3-setuptools \
gcc \
make \
apt-utils \
libxcb-shm0 \
libx11-xcb-dev \
libxext-dev \
libxrandr-dev \
libxcomposite-dev \
libxcursor-dev \
libxdamage-dev \
libxi-dev \
libxtst-dev \
libgtk-3-dev \
libasound-dev \
libdbus-glib-1-dev
pip install playwright
playwright install
playwright install-deps
Use in cloud environments like GCP or AWS
In order to use PlaywrightRequest
in AWS or GCP it's necessary to create a docker image with your code
(include the code shown above).
Running your docker image locally is straightforward but in the cloud there is a little issue because
the command playwright install-deps
install playwright
with the user root
.
In the cloud, the docker image is execute by a random user and that user is not able to find playwright
browsers...
the solution is to include the following command in the code you use PlaywrightRequest
and install
the browsers you need locally
import playwright
import os
# use only one of the following commands depending on your needs
os.system("playwright install") # use this to install all browsers
os.system("playwright install firefox") # to install firefox
os.system("playwright install chromium") # to install chromium
os.system("playwright install webkit") # to install webkit
Usage
Example #1: simple usage
from playwright_request.playwright_request import PlaywrightRequest
#crawl
requester = PlaywrightRequest()
responses = requester.get(urls=["SITE1","SITE2"])
print(responses[0].status_code, responses[0].html)
print(responses[1].status_code, responses[1].html)
Example #2: simple usage with Chromium
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.browser_type import BrowserType
#crawl
requester = PlaywrightRequest(browser=BrowserType.CHROMIUM, headless=False)
responses = requester.get(urls=["SITE1"])
print(responses[0].status_code, responses[0].html)
Example #3: define interceptor to avoid loading images
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor
#crawl
interceptor = RouteInterceptor().set_default_exclusions()
requester = PlaywrightRequest(route_interceptor=interceptor)
responses = requester.get(urls=["SITE1"])
print(responses[0].status_code, responses[0].html)
Example #4: extra processing results
from playwright.async_api import Page
from playwright_request.playwright_request import PlaywrightRequest
from playwright_request.route_interceptor import RouteInterceptor
async def get_all_photos(page: Page) -> list[str]:
# 1. Click on show all button and popup photo window
page.locator('button', has_text='Show all photos').click()
# 2. Wait for state is loaded and then wait for the selector
page.wait_for_load_state('networkidle', timeout=3000)
page.wait_for_load_state(timeout=3000)
page.wait_for_selector('div[data-testid=photo-viewer-section]', timeout=3000)
# 3. get photo section selector
photos_section = page.query_selector('div[data-testid=photo-viewer-section]')
# 4. get all picture elements within
all_pictures = photos_section.query_selector_all('picture')
# 5. get all selector images and extract the attribute we need
images = [a.query_selector("img").get_attribute("data-original-uri") for a in all_pictures]
# 6. close the popup window and return the images
page.locator('//button[@aria-label="Close"]').click()
return images
requester = PlaywrightRequest(extra_async_function_ptr=get_all_photos)
responses = requester.get(urls=[f"SITE-{k}" for k in range(100)])
images = [response.extra_result for response in responses]
for response in responses:
images = response.extra_result
print(response.status_code, len(images))
Example 5: detect amazon error pages
from playwright_request.commom_error_page_detectors.amazon_error_page_detector import AmazonErrorPageDetector
from playwright_request.playwright_request import PlaywrightRequest
amazon_detector = AmazonErrorPageDetector()
requester = PlaywrightRequest(error_page_detectors=[amazon_detector])
responses = requester.get(urls=[f"AMAZON-ASIN-{k}" for k in range(100)])
valid_htmls = [response.html for response in responses if response.status_code==200 and not response.error_list]
Author
Pedro Mayorga.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file playwright-request-1.5.0.tar.gz
.
File metadata
- Download URL: playwright-request-1.5.0.tar.gz
- Upload date:
- Size: 17.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e81c168c9334252f6996b623ef18e2f30a994fd3033d00d5b595296cb616e1f3 |
|
MD5 | b2b3d3c24ef34a10d2c7c4e0d5c1b3d9 |
|
BLAKE2b-256 | 75d4e57bbf472d65251f39b2d00620671eff37dabf48c9d2093bf1f2ee73ca09 |
File details
Details for the file playwright_request-1.5.0-py3-none-any.whl
.
File metadata
- Download URL: playwright_request-1.5.0-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d3d71a0e2ee997ff7050a89819be44b6bba14ff8e78ae385846d23943a3072f |
|
MD5 | 787d065e57f933b43f02c40834ac17a2 |
|
BLAKE2b-256 | 9abddcbc859ff2950b3a44a2ea5275834a4495ee3fd85fa7aeef37705f06f0bc |