Skip to main content

IP Rotator for Scrapy via Tor

Project description

scrapy-tor-proxy-rotation

The purpose of this module is to allow rotation of IPs to Scrapy via Tor.

Installation

Simple way to install, via pip:

pip install scrapy-tor-proxy-rotation

Configuring Tor

You need to configure Tor. First, install it:

sudo apt-get install tor

Stop its execution to perform configuration:

sudo service tor stop

Open your configuration file as root, available at /etc/tor/torrc, for example using nano:

sudo nano /etc/tor/torrc

Insert the lines below and save:

ControlPort 9051
CookieAuthentication 0

Restart Tor:

sudo service tor start

You can check your machine's IP and compare it with Tor's by doing the following:

  • To see your machine's IP:
    ```bash
    curl http://icanhazip.com/
    
  • To see Tor's IP:
    torify curl http://icanhazip.com/   
    

Tor proxies are not supported by Scrapy. To get around this problem, it is necessary to use an intermediary, in this case Privoxy.

The Tor proxy server is by default at 127.0.0.1:9050

Installing and configuring Privoxy:

  • Install:
    sudo apt install privoxy
    
  • Stop its execution:
    sudo service privoxy stop
    
  • Configure it to use TOr, open its configuration file:
    sudo nano /etc/privoxy/config
    
  • Add the following lines:
    forward-socks5t / 127.0.0.1:9050 .
    
  • Start it up:
    service privoxy start
    

By default, privoxy will run at the address 127.0.0.1:8118

Test:

torify curl http://icanhazip.com/
curl -x 127.0.0.1:8118 http://icanhazip.com/

The IP shown in the two steps above must be the same.

How to use

After you have made these settings, you can now integrate Tor with Scrapy.

  • Configure the middleware in your project's configuration file (settings.py):

    DOWNLOADER_MIDDLEWARES = {
        ...,
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'tor_ip_rotator.middlewares.TorProxyMiddleware': 100
    }
    
  • Enable the use of extension:

    TOR_IPROTATOR_ENABLED = True
    TOR_IPROTATOR_CHANGE_AFTER = #number of requests made on the same Tor's IP address
    

By default, an IP can be reused after 10 other uses. This value can be changed by the variable TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER, as below:

TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER = 0 #another integer value

A number too large for TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER may make it slower to retrieve a new IP for use or not find one at all. If the value is 0, there will be no record of used IPs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-tor-proxy-rotation-0.0.3.tar.gz (5.7 kB view hashes)

Uploaded Source

Built Distribution

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page