Skip to main content

All-in-one tools for financial analysis

Project description

Author : Yoshio Yamauchi 山内義生 == SPARKLE
Twitter : @sparkle_twtt
Medium : @sparkle_mdm
Email : sparkle.official.01@gmail.com
You can ask me whatever about the usage of scrapingtools

Description

This module helps you scrape the Internet without revealing your identity. It also features parallel processing allowing you to send requests concurrently over several threads. All programs are written in Python3 and you need Ubuntu-18.04 or later.

The anonymity is backed by the Tor network. The tor network is a free proxy chain network available for anyone without any registration. We assume that you use a Linux operation system, and if you do, it's not that difficult to set up the tor. I'll show you that below.

The threading is done by a python module "multiprocessing". It's different from a similar module "threading" in a meaning that "multiprocessing" actually splits tasks over multiple cores and run them concurrently, while "threading" is just a pseudo parallelization.

Required System

Ubuntu-18.04 or later
Python3

Python Dependencies

stem, random-user-agent, numpy, requests_html, lxml, requests, bs4

Install Tor and Privoxy

install tor and start

$ sudo apt update
$ sudo apt install tor
$ sudo srvice tor start

change password of tor

$ kill $(pidof tor)
$ sudo bash -c 'echo "ControlPort 9051" >> /etc/tor/torrc'
$ sudo bash -c 'echo HashedControlPassword $(tor --hash-password "password" | tail -n 1) >> /etc/tor/torrc'
$ sudo service tor restrat

install privoxy

$ sudo apt update
$ sudo apt install privoxy
$ sudo bash -c 'echo "forward-socks5t / 127.0.0.1:9050 ." >> /etc/privoxy/config'
$ sudo service privoxy restart

Usage

definition

class AnonymizedConcurrentRequest():
   def __init__(self, tor_password, proxies, port=9051, max_rpm=45, ipchange_interval=1,
                 num_processes=1, replace=True, verbose=False):

tor_password : the password of the tor server
proxies : the IP and port number of the tor server
port : tor setup port (9051 as default)
max_rpm : maximun number of requests sent per minute
ipcahge_interval : interval of checking IP
num_processes : number of subprocesses == degree of parallelization
replace : if files already exists, then replace that with new ones
verbose : show progress

runtest.py

Restart tor and privoxy

$ sudo /etc/init.d/tor restart
$ sudo /etc/init.d/privoxy restart

Import the module first

from scrapingtools import utils

Then give a dict of proxies, the setup port, and the password

PROXIES = {"https":"127.0.0.1:8118",
           "http":"127.0.0.1:8118"} # default
PORT = 9051 # default
PROXY_PASSWORD="password" # default

The URLs are given as a list of lists, each of which is a pair of a URL and the destination file for saving

TASKS = [["results_apple.txt","https://finance.yahoo.com/quote/AAPL?p=AAPL&.tsrc=fin-srch"],
         ["results_nvida.txt","https://finance.yahoo.com/quote/NVDA?p=NVDA&.tsrc=fin-srch"]]

Then run the program, giving the number of CPU cores, maximum number of requests sent per minute

ACR = utils.AnonymizedConcurrentRequest(PROXY_PASSWORD, max_rpm=60, ipchange_interval=1,
                                        num_processes=1, replace=True, proxies=PROXIES,
                                        port=PORT, verbose=True)
ACR.concurrent_request(TASKS)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fintool-1.0.2-py3.9.egg (107.4 kB view details)

Uploaded Egg

File details

Details for the file fintool-1.0.2-py3.9.egg.

File metadata

  • Download URL: fintool-1.0.2-py3.9.egg
  • Upload date:
  • Size: 107.4 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/51.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.9

File hashes

Hashes for fintool-1.0.2-py3.9.egg
Algorithm Hash digest
SHA256 4889659fa29b91c708539679fdef6158b3d93694167c10ec0e48dc7b076d5520
MD5 69b321013a9306625d42f330adb3a765
BLAKE2b-256 b2575deb052d7d13f482425d2d81a44f6a125d99dd65478107c4e18dcd73762e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page