Skip to main content

scrapy-scylla-proxies: Random proxy middleware for Scrapy that fetches valid proxies from Scylla.

Project description

Random proxy middleware for Scrapy

Using Scylla to fetch valid proxies.


NOTE: I am not a 'real' programmer, help always appreciated! But it works! ... for now.

Processes Scrapy requests using a random proxies to avoid IP ban and improve crawling speed, this plugs in to the Scylla project which provides a local database of proxies.

Install & run Scylla

The Scylla project will need to be set-up separately!! The quickest way to do this is to use the docker container. The following command will download and run Scylla (provided you have docker installed of course).

docker run -d -p 8899:8899 -p 8081:8081 --name scylla wildcat/scylla:latest

Install scrapy-scylla-proxies

The quick way:

pip install scrapy-scylla-proxies

Or checkout the source and run

python setup.py install

What to put in Scrapy's 'settings.py'

This is stuff you are going to need to integrate this middleware with Scrapy.

SSP_ENABLED - This MUST be set to True.

SSP_SCYLLA_URI - The location of the Scylla API (Default: 'http://localhost:8899').

SSP_PROXY_TIMEOUT - How often the proxy list is refreshed (Default: 60s).

SSP_HTTPS - Whether to only use HTTPS proxies, You will need this set to True if you are scraping an HTTPS site (Default: True).

SSP_SPLASH_REQUEST_ENABLED - Whether this middleware will need to set the proxy for a 'scrapy.Request' or a 'SplashRequest' (Default: False)

Example 'settings.py'

This is a sample taken directly from a working scraper of mine, I used it to scrape approximately 15000 items from a website without any 'bans'.

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # For retries
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 290,
    # For random scylla proxies
    'scrapy_scylla_proxies.random_proxy.RandomProxyMiddleware': 300,
    # For http proxy ip rotation
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
}

DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 10
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 400, 429, 403, 404]

# scrapy-scylla-proxies settings
# Enabled
SSP_ENABLED = True
# Location of the scylla server
SSP_SCYLLA_URI = 'http://localhost:8899'
# Proxy timeout in seconds
SSP_PROXY_TIMEOUT = 60
# Get only https proxies
SSP_HTTPS = True

Tips

I also find that rotating your user agent in combination with this middleware can be helpful in minimising failures due to being banned!

Donate

If you like this middleware or it was helpful to you, you can always send me a small donation, even just a token amount. It will encourage me to keep developing this middleware and improving it! :::fire:::

Donate here!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-scylla-proxies-0.5.0.5.tar.gz (6.5 kB view details)

Uploaded Source

File details

Details for the file scrapy-scylla-proxies-0.5.0.5.tar.gz.

File metadata

  • Download URL: scrapy-scylla-proxies-0.5.0.5.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.21.0 setuptools/39.0.1 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.6.6

File hashes

Hashes for scrapy-scylla-proxies-0.5.0.5.tar.gz
Algorithm Hash digest
SHA256 d6081b2bf5addd79a2c69a622031dc61095b8e1127c4d79368c111df22d5e07c
MD5 d1bafca2548d39f58f44b55c871f13b4
BLAKE2b-256 941d077a917b3aacea6a017b7e9c15a56394b5025f566b4a147a4c96499ce76b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page