Skip to main content

Proxy Components for Scrapy & Gerapy

Project description

Gerapy Proxy

This is a package for supporting proxy with async mechanism in Scrapy, also this package is a module in Gerapy.

Installation

pip3 install gerapy-proxy

Usage

If you have a ProxyPool which can provide a random proxy for every request, you can use this package to integrate proxy into your Scrapy/Gerapy Project.

For example, there is a ProxyPool API which can return a random proxy per time, we can configure GERAPY_PROXY_POOL_URL setting provided by this package to enable proxy for every Scrapy Request.

To use this package, firstly install it and then enable it in DownloadMiddleware:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_proxy.middlewares.ProxyPoolMiddleware': 543,
}

and add proxy url in settings:

GERAPY_PROXY_POOL_URL = 'https://proxypool.scrape.center/random'

This ProxyPool is configured based on this ProxyPool repo, you can also build your own ProxyPool service.

Now, you've finished it.

The ProxyPoolMiddleware will firstly fetch a proxy from GERAPY_PROXY_POOL_URL and set meta.proxy attribute to Scrapy Reqeust.

Configuration

Basic Auth

If your ProxyPool has Basic Auth, you can enable it by configuring these settings:

GERAPY_PROXY_POOL_AUTH = True
GERAPY_PROXY_POOL_USERNAME = <username>
GERAPY_PROXY_POOL_PASSWORD = <password>

Min Retry Times

If you want to enable Proxy depends on the retry times, you can configure this settings:

GERAPY_PROXY_POOL_MIN_RETRY_TIMES = 2

Then proxy will only work if the retry times of Request greater or equal than 2.

Random Enabled

If you want to enable the proxy randomly, you can configure the probability of enabling it:

GERAPY_PROXY_POOL_RANDOM_ENABLE_RATE = 0.8

Then probability of enabling the proxy is 80%, if you configure it to 1, proxy will always be enabled.

Fetch Timeout

You can also configure the max time of fetching proxy from ProxyPool:

GERAPY_PROXY_POOL_TIMEOUT = 5

After configuring this, if Proxy Pool does not return result in 5s, proxy will not be used.

ProxyPool Response Parser

Your ProxyPool may not return the same format as this in plain text, you can also define a parser to extract proxy from your ProxyPool.

For example, if your ProxyPool return this for every request:

{
  "host": "111.222.223.224",
  "port": 3128
}

You can define a method like:

import json
def parse_result(text):
    data = json.loads(text)
    return f'{data.get("host")}:{data.get("port")}'

GERAPY_PROXY_EXTRACT_FUNC = parse_result 

Then you will get the proxy with correct format.

Example

For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-proxy-example

Outputs:

2020-07-15 19:17:34 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2020-07-15 19:17:34 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, May  6 2020, 04:59:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.4.0-x86_64-i386-64bit
2020-07-15 19:17:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-15 19:17:34 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 3,
 'DOWNLOAD_TIMEOUT': 10,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_TIMES': 10,
 'SPIDER_MODULES': ['example.spiders']}
2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet Password: 33299ca0ce64f215
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-07-15 19:17:34 [asyncio] DEBUG: Using selector: KqueueSelector
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_proxy.middlewares.ProxyPoolMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-15 19:17:34 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-15 19:17:34 [scrapy.core.engine] INFO: Spider opened
2020-07-15 19:17:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-15 19:17:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:34 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 113.124.94.189:9999
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 84.53.238.49:23500
2020-07-15 19:17:35 [gerapy_proxy.middlewares] DEBUG: get proxy 217.150.77.31:53281
2020-07-15 19:17:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://httpbin.org/delay/3> (referer: None)
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:40 [example.spiders.httpbin] INFO: got request from 113.124.94.189 successfully, current page 1
2020-07-15 19:17:40 [gerapy_proxy.middlewares] DEBUG: get proxy 144.52.244.3:9999
2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://httpbin.org/delay/3> (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:45 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <POST https://httpbin.org/delay/3> (failed 1 times): User timeout caused connection failure: Getting https://httpbin.org/delay/3 took longer than 10.0 seconds..
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: start to get proxy from proxy pool
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy using kwargs {'timeout': 5, 'url': 'https://proxypool.scrape.center/random'}
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy 1.20.101.149:44778
2020-07-15 19:17:45 [gerapy_proxy.middlewares] DEBUG: get proxy 105.27.116.46:56792

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gerapy-proxy-0.0.3.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

gerapy_proxy-0.0.3-py2.py3-none-any.whl (6.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gerapy-proxy-0.0.3.tar.gz.

File metadata

  • Download URL: gerapy-proxy-0.0.3.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for gerapy-proxy-0.0.3.tar.gz
Algorithm Hash digest
SHA256 cd8582a34edef97bfa3aa9113e1f736e2ca78e6c91539c5aa6a16ca28c1ab2d5
MD5 116b4b6569eee30779848ff76dccb9fc
BLAKE2b-256 c1d92578b2ec60309e87d0c7c30e7a788a8693b54620864615e0fda5a7dfba6b

See more details on using hashes here.

File details

Details for the file gerapy_proxy-0.0.3-py2.py3-none-any.whl.

File metadata

  • Download URL: gerapy_proxy-0.0.3-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.3

File hashes

Hashes for gerapy_proxy-0.0.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 82aaa44a59e962f2ed2c11fb08a15cffc1de7fec5739892ca98746acfbc2652c
MD5 ad75d81d198e360f30da6e5cb8756175
BLAKE2b-256 0dda4a18d37737e1149f9c76bb8a964a163b754ca1965025c3cbfeb73106089a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page