IP Rotator for Scrapy via Tor
Project description
scrapy-tor-proxy-rotation
The purpose of this module is to allow rotation of IPs to Scrapy via Tor.
Installation
Simple way to install, via pip:
pip install scrapy-tor-proxy-rotation
Configuring Tor
You need to configure Tor. First, install it:
sudo apt-get install tor
Stop its execution to perform configuration:
sudo service tor stop
Open your configuration file as root, available at /etc/tor/torrc
, for example using nano:
sudo nano /etc/tor/torrc
Insert the lines below and save:
ControlPort 9051
CookieAuthentication 0
Restart Tor:
sudo service tor start
You can check your machine's IP and compare it with Tor's by doing the following:
- To see your machine's IP:
curl http://icanhazip.com/
- To see Tor's IP:
torify curl http://icanhazip.com/
Tor proxies are not supported by Scrapy. To get around this problem, it is necessary to use an intermediary, in this case Privoxy.
The Tor proxy server is by default at 127.0.0.1:9050
Installing and configuring Privoxy:
- Install:
sudo apt install privoxy
- Stop its execution:
sudo service privoxy stop
- Configure it to use TOr, open its configuration file:
sudo nano /etc/privoxy/config
- Add the following lines:
forward-socks5t / 127.0.0.1:9050 .
- Start it up:
service privoxy start
By default, privoxy will run at the address 127.0.0.1:8118
Test:
torify curl http://icanhazip.com/
curl -x 127.0.0.1:8118 http://icanhazip.com/
The IP shown in the two steps above must be the same.
How to use
After you have made these settings, you can now integrate Tor with Scrapy.
-
Configure the middleware in your project's configuration file (settings.py):
DOWNLOADER_MIDDLEWARES = { ..., 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'tor_ip_rotator.middlewares.TorProxyMiddleware': 100 }
-
Enable the use of extension:
TOR_IPROTATOR_ENABLED = True TOR_IPROTATOR_CHANGE_AFTER = #number of requests made on the same Tor's IP address
By default, an IP can be reused after 10 other uses. This value can be changed by the variable TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER, as below:
TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER = 0 #another integer value
A number too large for TOR_IPROTATOR_ALLOW_REUSE_IP_AFTER may make it slower to retrieve a new IP for use or not find one at all. If the value is 0, there will be no record of used IPs.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-tor-proxy-rotation-0.0.4.tar.gz
.
File metadata
- Download URL: scrapy-tor-proxy-rotation-0.0.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1533a50474afa1d785e3f4b38dabc4849f7dfc8d87f254d55988471ca229bf86 |
|
MD5 | d02ed3091c53809808e2b777d5d35c12 |
|
BLAKE2b-256 | 55e688158429d452254e199fae80ce0b5ca74ee21af5e288fdd2f4cd1f03dfb3 |
File details
Details for the file scrapy_tor_proxy_rotation-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: scrapy_tor_proxy_rotation-0.0.4-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 390fb7af52cadca9db38bd1cc96779155cf0805292ab7d514674098be7a1221d |
|
MD5 | 80a0ce40d7e78d9a25bd9c97943f6d5e |
|
BLAKE2b-256 | 0c3a96053b80375c67a97e0a26fe3766d00e37d965ee87c256b70ee82335fcdd |