An async scrapy request downloader middleware, support random request and response manipulation.

These details have not been verified by PyPI

Project links

Homepage

Project description

Scrapy Manipulate Request Downloader Middleware

This is an async scrapy request downloader middleware, support random request and response manipulation.

With this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client,

pyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without

any thinking of the async logic behind the scrapy.

Installation

pip3 install scrapy-manipulate-request

Usage

You need to enable ManipulateRequestDownloaderMiddleware in DOWNLOADER_MIDDLEWARES first:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,
}

Notice, this middleware is async, that means it is affected by some scrapy settings, such as:

CONCURRENT_REQUESTS = 16

To manipulate request and response, it is very simple and convenient, just add manipulate_request function

in your spider, and send it to the meta, it's something like parse function.

import scrapy

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
    
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
        pass
    
    def parse(self, response):
        pass

Useful Example

Send request by tls_client in order to bypass ja3 verification

import scrapy
import tls_client
from scrapy.http import TextResponse

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        url = request.url
        headers = request.headers.to_unicode_dict()
        tls_session = tls_client.Session(
            client_identifier='chrome_112',
            random_tls_extension_order=True
        )
        proxy = 'http://username:password@ip:port'
        raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)
        response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,
                                body=raw_response.text, request=request, encoding='utf-8')
        return response
        
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
    
    def parse(self, response):
        pass

More and detailed tls_client usage see Python-Tls-Client.

Use undetected chrom to operate webpage

import scrapy
from pprint import pformat
from scrapy.http import HtmlResponse
from seleniumwire import undetected_chromedriver as uc

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        chrome_options = uc.ChromeOptions()
        chrome_options.add_experimental_option()
        chrome_options.add_argument()
        chrome_options.add_extension()
        seleniumwire_options = {
            'proxy': {
                'http': 'http://username:password@ip:port',
                'https': 'https://username:password@ip:port',
            }
        }
        browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,
                            headless=True, enable_cdp_events=True)
        browser.set_page_load_timeout(10)
        browser.maximize_window()
        browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)
        browser.execute_script()
        browser.execute_cdp_cmd()
        browser.request_interceptor = self.request_interceptor
        browser.get("https://tls.browserleaks.com/json")
        elements = browser.find_elements()
        ...
        raw_response = browser.page_source
        response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')
        return response
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.

    def mylousyprintfunction(self, message):
        print(pformat(message))

    def request_interceptor(self, request):
        request.headers['New-Header'] = 'Some Value'
        del request.headers['Referer']
        request.headers['Referer'] = 'some_referer'

More and detailed chrome operations see undetected-chromedriver and selenium-wire.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.2

Jun 21, 2023

This version

0.0.1

Jun 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-manipulate-request-0.0.1.tar.gz (6.4 kB view details)

Uploaded Jun 21, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scrapy_manipulate_request-0.0.1-py3-none-any.whl (7.4 kB view details)

Uploaded Jun 21, 2023 Python 3

File details

Details for the file scrapy-manipulate-request-0.0.1.tar.gz.

File metadata

Download URL: scrapy-manipulate-request-0.0.1.tar.gz
Upload date: Jun 21, 2023
Size: 6.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for scrapy-manipulate-request-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`1731d7e85ad06fd7f94cc0ade2c36c13ef509f0c47169b2b1a628e6ed1fd62b7`
MD5	`db744e887e994638ca98f826d2a54b98`
BLAKE2b-256	`272312bf4ca8c8be760eca9bb41e8fefcd9a5730fe86cd013c429eb7a646a39e`

See more details on using hashes here.

File details

Details for the file scrapy_manipulate_request-0.0.1-py3-none-any.whl.

File metadata

Download URL: scrapy_manipulate_request-0.0.1-py3-none-any.whl
Upload date: Jun 21, 2023
Size: 7.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for scrapy_manipulate_request-0.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68b702a0256c2df97563448872f0173f0d55ef69a5c14324bcf5bdecb499c46d`
MD5	`ac84466939e63de2e7aebf853de1fb39`
BLAKE2b-256	`846b5bf8123d7c7f987261f114977f5013b816cad1d4cd7cc9f6255a2f0e3eee`

See more details on using hashes here.

scrapy-manipulate-request 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Scrapy Manipulate Request Downloader Middleware

Installation

Usage

Useful Example

Send request by tls_client in order to bypass ja3 verification

Use undetected chrom to operate webpage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes