Skip to main content

IP Rotator for Scrapy via Tor

Project description

scrapy-tor-proxy-rotation

The purpose of this module is to allow rotation of IPs to Scrapy via Tor.

Installation

Simple way to install, via pip:

pip install scrapy-tor-proxy-rotation

Configuring Tor

You need to configure Tor. First, install it:

sudo apt-get install tor

Stop its execution to perform configuration:

sudo service tor stop

Open your configuration file as root, available at /etc/tor/torrc, for example using nano:

sudo nano /etc/tor/torrc

Insert the lines below and save:

ControlPort 9051
CookieAuthentication 0

Restart Tor:

sudo service tor start

You can check your machine's IP and compare it with Tor's by doing the following:

  • To see your machine's IP:
    curl http://icanhazip.com/
    
  • To see Tor's IP:
    torify curl http://icanhazip.com/   
    

Tor proxies are not supported by Scrapy. To get around this problem, it is necessary to use an intermediary, in this case Privoxy.

The Tor proxy server is by default at 127.0.0.1:9050

Installing and configuring Privoxy:

  • Install:
    sudo apt install privoxy
    
  • Stop its execution:
    sudo service privoxy stop
    
  • Configure it to use TOr, open its configuration file:
    sudo nano /etc/privoxy/config
    
  • Add the following lines:
    forward-socks5t / 127.0.0.1:9050 .
    
  • Start it up:
    service privoxy start
    

By default, privoxy will run at the address 127.0.0.1:8118

Test:

torify curl http://icanhazip.com/
curl -x 127.0.0.1:8118 http://icanhazip.com/

The IP shown in the two steps above must be the same.

How to use

After you have made these settings, you can now integrate Tor with Scrapy.

  • Configure the middleware in your project's configuration file (settings.py):

    DOWNLOADER_MIDDLEWARES = {
        ...,
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'tor_ip_rotator.middlewares.TorProxyMiddleware': 100
    }
    
  • Enable the use of extension:

    TOR_IPROTATOR_ENABLED = True
    TOR_IPROTATOR_CHANGE_AFTER = #number of requests made on the same Tor's IP address
    

By default, an IP can be reused after 10 other uses. This value can be changed by the variable TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER, as below:

TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER = 0 #another integer value

A number too large for TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER may make it slower to retrieve a new IP for use or not find one at all. If the value is 0, there will be no record of used IPs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-tor-proxy-rotation-0.0.4.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file scrapy-tor-proxy-rotation-0.0.4.tar.gz.

File metadata

File hashes

Hashes for scrapy-tor-proxy-rotation-0.0.4.tar.gz
Algorithm Hash digest
SHA256 1533a50474afa1d785e3f4b38dabc4849f7dfc8d87f254d55988471ca229bf86
MD5 d02ed3091c53809808e2b777d5d35c12
BLAKE2b-256 55e688158429d452254e199fae80ce0b5ca74ee21af5e288fdd2f4cd1f03dfb3

See more details on using hashes here.

File details

Details for the file scrapy_tor_proxy_rotation-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_tor_proxy_rotation-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 390fb7af52cadca9db38bd1cc96779155cf0805292ab7d514674098be7a1221d
MD5 80a0ce40d7e78d9a25bd9c97943f6d5e
BLAKE2b-256 0c3a96053b80375c67a97e0a26fe3766d00e37d965ee87c256b70ee82335fcdd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page