Easy multythread web scraper
Project description
MulTithreaded SCRAPER
Hello, welcome you here. This is the mt_scraper library documentation for python version 3.
Description
This is a project of a multithreaded site scraper. Multithreading operation speeds up data collection from Web several times (more than 10 on a normal old work laptop). To use it, you need to redefine the parse method for your needs and enjoy the benefits of multithreading (with all its implementation in Python)
Collecting data in the JSON file, which stores a list of objects (dictionaries) with the collected data.
Application
Simple application
Main Library Usage Scenario
import mt_scraper
scraper = mt_scraper.Scraper ()
scraper.run ()
As you can see there are only three lines of code
What happens when this happens
With this application, you get a data scraper from the pages of the list:
url_components_list = [
'http://example.com/',
'http://scraper.iamengineer.ru',
'http://scraper.iamengineer.ru/bad-file.php',
'http://badlink-for-scarper.ru',
]
The last two pages were added to demonstrate the two most common errors when retrieving data from the Internet, these are HTTP 404 - Not Found, and the URL Error: Name: or service not known.
The real URL is obtained by substituting this list into a template:
url_template = '{}'
Data is accumulated in the file:
out_filename = 'out.json'
The work is conducted in 5 threads and a task queue of 5 units is created (this has a value, for example, when canceling an operation from the keyboard, the queue length indicates how many tasks were sent for execution):
threads_num = 5
queue_len = 5
The following is used as a parser function:
def parse (self, num, url_component, html):
'''You must override this method.
Must return a dictionary or None if parsing the page
impossible
'''
parser = MyDummyHTMLParser ()
parser.feed (html)
obj = parser.obj
obj ['url_component'] = url_component
return parser.obj
DummyParser is a simple version of HTML parser, it is remarkable only because it uses only one standard library and does not require any additional modules. File dummy_parser.py:
from html.parser import HTMLParser
class MyDummyHTMLParser (HTMLParser):
def __init __ (self):
super () .__ init __ ()
self.a_tag = False
self.h1_tag = False
self.p_tag = False
self.obj = {}
def handle_starttag (self, tag, attrs):
if tag == 'h1':
self.h1_tag = True
elif tag == 'p':
self.p_tag = True
elif tag == 'a':
self.a_tag = True
for (attr, value,) in attrs:
if attr == 'href':
self.obj ['link'] = value
def handle_endtag (self, tag):
if tag == 'h1':
self.h1_tag = False
elif tag == 'p':
self.p_tag = False
elif tag == 'a':
self.a_tag = False
def handle_data (self, data):
if self.h1_tag:
self.obj ['header'] = data
elif self.p_tag and not self.a_tag:
self.obj ['article'] = data
This approach is used only to demonstrate the capabilities of multithreading, in real projects it is recommended to use the lxml or BS libraries, a more advanced application will be shown below in the section "Advanced Application"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mt_scraper-0.3.5.tar.gz
.
File metadata
- Download URL: mt_scraper-0.3.5.tar.gz
- Upload date:
- Size: 6.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17df48ac11b34896a25d637a0289e3873e603043b5f8ca4d7d6cc47f92e292dd |
|
MD5 | e469403cea4c102baf9c2a992c4f66dc |
|
BLAKE2b-256 | 473e22a8379f0facc76815e3cd62cc5ba0c297ddf182ddee18ab7c11a7fa98e9 |
File details
Details for the file mt_scraper-0.3.5-py3-none-any.whl
.
File metadata
- Download URL: mt_scraper-0.3.5-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9b4079e2060d732feeb45ffb52ed7c8c283d77bf26e5c3ee7f2a7d7762a617f9 |
|
MD5 | fe44e226c1b4d1c6ee8f615f6b3edbaa |
|
BLAKE2b-256 | 6deabc0e3fd5ed7ca06f0b809531b9d970a358f77026b296605f9950c8357bbc |