Skip to main content

All-in-one tools for financial analysis

Project description

Author : Yoshio Yamauchi 山内義生 == SPARKLE
Twitter : @sparkle_twtt
Medium : @sparkle_mdm
Email : sparkle.official.01@gmail.com
You can ask me whatever about the usage of scrapingtools

Description

This module helps you scrape the Internet without revealing your identity. It also features parallel processing allowing you to send requests concurrently over several threads. All programs are written in Python3 and you need Ubuntu-18.04 or later.

The anonymity is backed by the Tor network. The tor network is a free proxy chain network available for anyone without any registration. We assume that you use a Linux operation system, and if you do, it's not that difficult to set up the tor. I'll show you that below.

The threading is done by a python module "multiprocessing". It's different from a similar module "threading" in a meaning that "multiprocessing" actually splits tasks over multiple cores and run them concurrently, while "threading" is just a pseudo parallelization.

Required System

Ubuntu-18.04 or later
Python3

Python Dependencies

stem, random-user-agent, numpy, requests_html, lxml, requests, bs4

Install Tor and Privoxy

install tor and start

$ sudo apt update
$ sudo apt install tor
$ sudo srvice tor start

change password of tor

$ kill $(pidof tor)
$ sudo bash -c 'echo "ControlPort 9051" >> /etc/tor/torrc'
$ sudo bash -c 'echo HashedControlPassword $(tor --hash-password "password" | tail -n 1) >> /etc/tor/torrc'
$ sudo service tor restrat

install privoxy

$ sudo apt update
$ sudo apt install privoxy
$ sudo bash -c 'echo "forward-socks5t / 127.0.0.1:9050 ." >> /etc/privoxy/config'
$ sudo service privoxy restart

Usage

definition

class AnonymizedConcurrentRequest():
   def __init__(self, tor_password, proxies, port=9051, max_rpm=45, ipchange_interval=1,
                 num_processes=1, replace=True, verbose=False):

tor_password : the password of the tor server
proxies : the IP and port number of the tor server
port : tor setup port (9051 as default)
max_rpm : maximun number of requests sent per minute
ipcahge_interval : interval of checking IP
num_processes : number of subprocesses == degree of parallelization
replace : if files already exists, then replace that with new ones
verbose : show progress

runtest.py

Restart tor and privoxy

$ sudo /etc/init.d/tor restart
$ sudo /etc/init.d/privoxy restart

Import the module first

from scrapingtools import utils

Then give a dict of proxies, the setup port, and the password

PROXIES = {"https":"127.0.0.1:8118",
           "http":"127.0.0.1:8118"} # default
PORT = 9051 # default
PROXY_PASSWORD="password" # default

The URLs are given as a list of lists, each of which is a pair of a URL and the destination file for saving

TASKS = [["results_apple.txt","https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch"],
         ["results_nvida.txt","https://finance.yahoo.com/quote/NVDA?p=NVDA&.tsrc=fin-srch"]]

Then run the program, giving the number of CPU cores, maximum number of requests sent per minute

ACR = utils.AnonymizedConcurrentRequest(PROXY_PASSWORD, max_rpm=60, ipchange_interval=1,
                                        num_processes=1, replace=True, proxies=PROXIES,
                                        port=PORT, verbose=True)
ACR.concurrent_request(TASKS)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

financialanalysis-1.0.2.tar.gz (30.8 kB view hashes)

Uploaded Source

Built Distribution

financialanalysis-1.0.2-py3.9.egg (159.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page