scrapy-scylla-proxies: Random proxy middleware for Scrapy that fetches valid proxies from Scylla.
Project description
Random proxy middleware for Scrapy
Using Scylla to fetch valid proxies.
NOTE: I am not a 'real' programmer, help always appreciated! But it works! ... for now.
Processes Scrapy requests using a random proxies to avoid IP ban and improve crawling speed, this plugs in to the Scylla project which provides a local database of proxies.
Install & run Scylla
The Scylla project will need to be set-up separately!! The quickest way to do this is to use the docker container. The following command will download and run Scylla (provided you have docker installed of course).
docker run -d -p 8899:8899 -p 8081:8081 --name scylla wildcat/scylla:latest
Install scrapy-scylla-proxies
The quick way:
pip install scrapy-scylla-proxies
Or checkout the source and run
python setup.py install
What to put in Scrapy's 'settings.py'
This is stuff you are going to need to integrate this middleware with Scrapy.
SSP_ENABLED - This MUST be set to True.
SSP_SCYLLA_URI - The location of the Scylla API (Default: 'http://localhost:8899').
SSP_PROXY_TIMEOUT - How often the proxy list is refreshed (Default: 60s).
SSP_HTTPS - Whether to only use HTTPS proxies, You will need this set to True if you are scraping an HTTPS site (Default: True).
SSP_SPLASH_REQUEST_ENABLED - Whether this middleware will need to set the proxy for a 'scrapy.Request' or a 'SplashRequest' (Default: False)
Example 'settings.py'
This is a sample taken directly from a working scraper of mine, I used it to scrape approximately 15000 items from a website without any 'bans'.
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# For retries
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 290,
# For random scylla proxies
'scrapy_scylla_proxies.random_proxy.RandomProxyMiddleware': 300,
# For http proxy ip rotation
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
}
DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 10
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 400, 429, 403, 404]
# scrapy-scylla-proxies settings
# Enabled
SSP_ENABLED = True
# Location of the scylla server
SSP_SCYLLA_URI = 'http://localhost:8899'
# Proxy timeout in seconds
SSP_PROXY_TIMEOUT = 60
# Get only https proxies
SSP_HTTPS = True
Tips
I also find that rotating your user agent in combination with this middleware can be helpful in minimising failures due to being banned!
Donate
If you like this middleware or it was helpful to you, you can always send me a small donation, even just a token amount. It will encourage me to keep developing this middleware and improving it! :::fire:::
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file scrapy-scylla-proxies-0.5.0.5.tar.gz
.
File metadata
- Download URL: scrapy-scylla-proxies-0.5.0.5.tar.gz
- Upload date:
- Size: 6.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6081b2bf5addd79a2c69a622031dc61095b8e1127c4d79368c111df22d5e07c |
|
MD5 | d1bafca2548d39f58f44b55c871f13b4 |
|
BLAKE2b-256 | 941d077a917b3aacea6a017b7e9c15a56394b5025f566b4a147a4c96499ce76b |