Skip to main content

An async scrapy request downloader middleware, support random request and response manipulation.

Project description

Scrapy Manipulate Request Downloader Middleware

This is an async scrapy request downloader middleware, support random request and response manipulation.

With this, you can do any change to the reqeust and response in an easy way, you can send request by tls_client,

pyhttpx, requests-go, etc. You can even manipulate chrome by selenium, undetected_chrome, playwright, etc. without

any thinking of the async logic behind the scrapy.

Installation

pip3 install scrapy-manipulate-request

Usage

You need to enable ManipulateRequestDownloaderMiddleware in DOWNLOADER_MIDDLEWARES first:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_manipulate_request.downloadermiddlewares.ManipulateRequestDownloaderMiddleware': 543,
}

Notice, this middleware is async, that means it is affected by some scrapy settings, such as:

CONCURRENT_REQUESTS = 16

To manipulate request and response, it is very simple and convenient, just add manipulate_request function

in your spider, and send it to the meta, it's something like parse function.

import scrapy

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
    
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
        pass
    
    def parse(self, response):
        pass

Useful Example

Send request by tls_client in order to bypass ja3 verification

import scrapy
import tls_client
from scrapy.http import TextResponse

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        url = request.url
        headers = request.headers.to_unicode_dict()
        tls_session = tls_client.Session(
            client_identifier='chrome_112',
            random_tls_extension_order=True
        )
        proxy = 'http://username:password@ip:port'
        raw_response = tls_session.get(url=url, headers=headers, proxy=proxy)
        response = TextResponse(url=request.url, status=raw_response.status_code, headers=raw_response.headers,
                                body=raw_response.text, request=request, encoding='utf-8')
        return response
        
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.
    
    def parse(self, response):
        pass

More and detailed tls_client usage see Python-Tls-Client.

Use undetected chrom to operate webpage

import scrapy
from pprint import pformat
from scrapy.http import HtmlResponse
from seleniumwire import undetected_chromedriver as uc

class TestSpider(scrapy.Spider):
    name = "test"

    def start_requests(self,):
        meta_data = {'manipulate_request', self.manipulate_request}
        yield scrapy.Request(url="https://tls.browserleaks.com/json", meta=meta_data)
    
    def manipulate_request(self, request, spider):
        chrome_options = uc.ChromeOptions()
        chrome_options.add_experimental_option()
        chrome_options.add_argument()
        chrome_options.add_extension()
        seleniumwire_options = {
            'proxy': {
                'http': 'http://username:password@ip:port',
                'https': 'https://username:password@ip:port',
            }
        }
        browser = uc.Chrome(version_main=108, options=chrome_options, seleniumwire_options= seleniumwire_options,
                            headless=True, enable_cdp_events=True)
        browser.set_page_load_timeout(10)
        browser.maximize_window()
        browser.add_cdp_listener('Network.requestWillBeSent', self.mylousyprintfunction)
        browser.execute_script()
        browser.execute_cdp_cmd()
        browser.request_interceptor = self.request_interceptor
        browser.get("https://tls.browserleaks.com/json")
        elements = browser.find_elements()
        ...
        raw_response = browser.page_source
        response = HtmlResponse(url=request.url, status=200, body=raw_response, request=request, encoding='utf-8')
        return response
        # return None, the requesst will be ignored
        # return scrapy.http.HtmlResponse or scrapy.http.TextResponse object,
        # the process of handle response will be started.

    def mylousyprintfunction(self, message):
        print(pformat(message))

    def request_interceptor(self, request):
        request.headers['New-Header'] = 'Some Value'
        del request.headers['Referer']
        request.headers['Referer'] = 'some_referer'

More and detailed chrome operations see undetected-chromedriver and selenium-wire.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-manipulate-request-0.0.1.tar.gz (6.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrapy_manipulate_request-0.0.1-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-manipulate-request-0.0.1.tar.gz.

File metadata

File hashes

Hashes for scrapy-manipulate-request-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1731d7e85ad06fd7f94cc0ade2c36c13ef509f0c47169b2b1a628e6ed1fd62b7
MD5 db744e887e994638ca98f826d2a54b98
BLAKE2b-256 272312bf4ca8c8be760eca9bb41e8fefcd9a5730fe86cd013c429eb7a646a39e

See more details on using hashes here.

File details

Details for the file scrapy_manipulate_request-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_manipulate_request-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 68b702a0256c2df97563448872f0173f0d55ef69a5c14324bcf5bdecb499c46d
MD5 ac84466939e63de2e7aebf853de1fb39
BLAKE2b-256 846b5bf8123d7c7f987261f114977f5013b816cad1d4cd7cc9f6255a2f0e3eee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page