Skip to main content

Advanced Scrapy Proxies: random proxy middleware for Scrapy with advanced features

Project description

advanced-scrapy-proxies

advanced-scrapy-proxies is a Python library for dealing with proxies in your Scrapy project. Starting from Aivarsk's scrapy proxy (no more updated since 2018) i'm adding more features to manage lists of proxies generated dinamically.

Installation

Use the package manager pip to install advanced-scrapy-proxies.

pip install advanced-scrapy-proxies

Usage

settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'advanced-scrapy-proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110
}

## Proxy mode
	# -1: NO_PROXY, middleware is configured but does nothing. Useful when needed to automate the selection of the mode
	# 0: RANDOMIZE_PROXY_EVERY_REQUESTS, every requrest use a different proxy
	# 1: RANDOMIZE_PROXY_ONCE, selects one proxy for the whole execution from the input list
	# 2: SET_CUSTOM_PROXY, use the proxy specified with option CUSTOM_PROXY
	# 3: REMOTE_PROXY_LIST, use the proxy list at the specified URL
PROXY_MODE = 3

PROXY_LIST ='https://yourproxylisturl/list.txt'

As every scrapy project, you can override the settings in settings.py when calling the scraper.

##PROXY_MODE=-1, the spider does not use the proxy list provided.
scrapy crawl myspider -s PROXY_MODE=-1 -s PROXY_LIST='myproxylist.txt'
##PROXY_MODE=0, the spider use the proxy list provided, choosing for every request a different proxy. 
scrapy crawl myspider -s PROXY_MODE=0 -s PROXY_LIST='myproxylist.txt'
##PROXY_MODE=1, the spider use the proxy list provided, choosing only one proxy for the whole execution.
scrapy crawl myspider -s PROXY_MODE=1 -s PROXY_LIST='myproxylist.txt'
##PROXY_MODE=2, the spider uses the proxy provided.
scrapy crawl myspider -s PROXY_MODE=2 -s PROXY_LIST='http://myproxy.com:80'
##PROXY_MODE=3, the spider uses the proxy list at the url provided. The list is read at every request made by the spider, so it can be updated during the execution.
scrapy crawl myspider -s PROXY_MODE=3 -s PROXY_LIST='http://myproxy.com:80'

Planned new features and updates

Minor updates

  • adding more tests on the format of the input variables
  • rewriting error messages

New features

  • Adding a cooldown list: instead of deleting proxy after a failed attempt to get data, use a cooldown list where they are not used for a limited time in the scraper but ready to be reused when the cooldown finishes.
  • Adding support for reading urls of the lists behind user and password
  • Updating proxy list at every request even for PROXY_MODE=0

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

License

GNU GPLv2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advanced-scrapy-proxies-0.1.3.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advanced_scrapy_proxies-0.1.3-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file advanced-scrapy-proxies-0.1.3.tar.gz.

File metadata

  • Download URL: advanced-scrapy-proxies-0.1.3.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.7.9

File hashes

Hashes for advanced-scrapy-proxies-0.1.3.tar.gz
Algorithm Hash digest
SHA256 52356685740fb5de3a57f5c93a8353eead6e860be883453229559debdf0650ac
MD5 e3630984fd3fefa5f6274fc19f4f5c77
BLAKE2b-256 df456f04b34de1e8eadbe8a95a229753e13c2de85199dd8afb9e2850b0223881

See more details on using hashes here.

File details

Details for the file advanced_scrapy_proxies-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for advanced_scrapy_proxies-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6101306a9a2577574bb524d3da7cb7b9f007532745cc6236a40dcf82c6d3ac0d
MD5 73e48a31ccbff716419f07dfe0be9292
BLAKE2b-256 0d7946e415c6ecfb64250ac0003af627696fb2d434546b032cfbc22055534242

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page