All-in-one tools for financial analysis
Project description
Author : Yoshio Yamauchi 山内義生 == SPARKLE
Twitter : @sparkle_twtt
Medium : @sparkle_mdm
Email : sparkle.official.01@gmail.com
You can ask me whatever about the usage of scrapingtools
Description
This module helps you scrape the Internet without revealing your identity. It also features parallel processing allowing you to send requests concurrently over several threads. All programs are written in Python3 and you need Ubuntu-18.04 or later.
The anonymity is backed by the Tor network. The tor network is a free proxy chain network available for anyone without any registration. We assume that you use a Linux operation system, and if you do, it's not that difficult to set up the tor. I'll show you that below.
The threading is done by a python module "multiprocessing". It's different from a similar module "threading" in a meaning that "multiprocessing" actually splits tasks over multiple cores and run them concurrently, while "threading" is just a pseudo parallelization.
Required System
Ubuntu-18.04 or later
Python3
Python Dependencies
stem
, random-user-agent
, numpy
, requests_html
, lxml
, requests
, bs4
Install Tor and Privoxy
install tor and start
$ sudo apt update
$ sudo apt install tor
$ sudo srvice tor start
change password of tor
$ kill $(pidof tor)
$ sudo bash -c 'echo "ControlPort 9051" >> /etc/tor/torrc'
$ sudo bash -c 'echo HashedControlPassword $(tor --hash-password "password" | tail -n 1) >> /etc/tor/torrc'
$ sudo service tor restrat
install privoxy
$ sudo apt update
$ sudo apt install privoxy
$ sudo bash -c 'echo "forward-socks5t / 127.0.0.1:9050 ." >> /etc/privoxy/config'
$ sudo service privoxy restart
Usage
definition
class AnonymizedConcurrentRequest():
def __init__(self, tor_password, proxies, port=9051, max_rpm=45, ipchange_interval=1,
num_processes=1, replace=True, verbose=False):
tor_password
: the password of the tor server
proxies
: the IP and port number of the tor server
port
: tor setup port (9051 as default)
max_rpm
: maximun number of requests sent per minute
ipcahge_interval
: interval of checking IP
num_processes
: number of subprocesses == degree of parallelization
replace
: if files already exists, then replace that with new ones
verbose
: show progress
runtest.py
Restart tor and privoxy
$ sudo /etc/init.d/tor restart
$ sudo /etc/init.d/privoxy restart
Import the module first
from scrapingtools import utils
Then give a dict of proxies, the setup port, and the password
PROXIES = {"https":"127.0.0.1:8118",
"http":"127.0.0.1:8118"} # default
PORT = 9051 # default
PROXY_PASSWORD="password" # default
The URLs are given as a list of lists, each of which is a pair of a URL and the destination file for saving
TASKS = [["results_apple.txt","https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch"],
["results_nvida.txt","https://finance.yahoo.com/quote/NVDA?p=NVDA&.tsrc=fin-srch"]]
Then run the program, giving the number of CPU cores, maximum number of requests sent per minute
ACR = utils.AnonymizedConcurrentRequest(PROXY_PASSWORD, max_rpm=60, ipchange_interval=1,
num_processes=1, replace=True, proxies=PROXIES,
port=PORT, verbose=True)
ACR.concurrent_request(TASKS)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for financialanalysis-1.0.2-py3.9.egg
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63b61ba11e2cd39f277883887a9418a9a59735c9c487d788341adf01acfedfd0 |
|
MD5 | 437c54c39455ca36c4616d4756715dd3 |
|
BLAKE2b-256 | 8736aeb8aa0211cd95b648126c62e310c559f168af55086d53fe2716d86b8f34 |