Scrapy Proxies: random proxy middleware for Scrapy(support load proxies from IPProxyTool)

These details have not been verified by PyPI

Project links

Homepage

Project description

Random proxy middleware for Scrapy (http://scrapy.org/)

base on https://github.com/aivarsk/scrapy-proxies , support load proxies from https://github.com/qiyeboy/IPProxyPool

Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

Get your proxy list from sites like http://www.hidemyass.com/ (copy-paste into text file and reformat to http://host:port format)

Install

The quick way:

pip install scrapy-proxies-tool

Or checkout the source and run

python setup.py install

settings.py

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
  'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
  'scrapy_proxies.RandomProxy': 100,
  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_SETTINGS = {
  # Proxy list containing entries like
  # http://host1:port
  # http://username:password@host2:port
  # http://host3:port
  # ...
  # if PROXY_SETTINGS[from_proxies_server] = True , proxy_list is server address (ref https://github.com/qiyeboy/IPProxyPool and https://github.com/awolfly9/IPProxyTool )
  # Only support http(ref https://github.com/qiyeboy/IPProxyPool#%E5%8F%82%E6%95%B0)
  # list : ['http://localhost:8000?protocol=0'],
  'list':['/path/to/proxy/list.txt'],

  # disable proxy settings and  use real ip when all proxies are unusable
  'use_real_when_empty':False,
  'from_proxies_server':False,

  # If proxy mode is 2 uncomment this sentence :
  # 'custom_proxy': "http://host1:port",

  # Proxy mode
  # 0 = Every requests have different proxy
  # 1 = Take only one proxy from the list and assign it to every requests
  # 2 = Put a custom proxy to use in the settings
  'mode':0
}

For older versions of Scrapy (before 1.0.0) you have to use scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware middlewares instead.

Your spider

In each callback ensure that proxy /really/ returned your target page by checking for site logo or some other significant element. If not - retry request with dont_filter=True

  if not hxs.select('//get/site/logo'):
    yield Request(url=response.url, dont_filter=True)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4.0

Sep 15, 2018

0.3.0

Sep 15, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-proxies-tool-0.4.0.tar.gz (4.3 kB view details)

Uploaded Sep 15, 2018 Source

File details

Details for the file scrapy-proxies-tool-0.4.0.tar.gz.

File metadata

Download URL: scrapy-proxies-tool-0.4.0.tar.gz
Upload date: Sep 15, 2018
Size: 4.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.9.1 setuptools/40.2.0 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.12

File hashes

Hashes for scrapy-proxies-tool-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`1b5e2614b91625f52906b6f990b86efcafc3141eb3886f0de5978bbe63523915`
MD5	`42faf711cde6fd74f28394bdca1b93a8`
BLAKE2b-256	`d69a529796b44339608ff26026831bf8ebf123a804e001535ef8239f75329416`

See more details on using hashes here.

scrapy-proxies-tool 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Random proxy middleware for Scrapy (http://scrapy.org/)

Install

settings.py

Your spider

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes