Easy multythread web scraper

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

MulTithreaded SCRAPER

Hello, welcome you here. This is the mt_scraper library documentation for python version 3.

Description

This is a project of a multithreaded site scraper. Multithreading operation speeds up data collection from Web several times (more than 10 on a normal old work laptop). To use it, you need to redefine the parse method for your needs and enjoy the benefits of multithreading (with all its implementation in Python)

Collecting data in the JSON file, which stores a list of objects (dictionaries) with the collected data.

Application

Simple application

Main Library Usage Scenario

import mt_scraper

scraper = mt_scraper.Scraper ()
scraper.run ()

As you can see there are only three lines of code

What happens when this happens

With this application, you get a data scraper from the pages of the list:

url_components_list = [
    'http://example.com/',
    'http://scraper.iamengineer.ru',
    'http://scraper.iamengineer.ru/bad-file.php',
    'http://badlink-for-scarper.ru',
]

The last two pages were added to demonstrate the two most common errors when retrieving data from the Internet, these are HTTP 404 - Not Found, and the URL Error: Name: or service not known.

The real URL is obtained by substituting this list into a template:

url_template = '{}'

Data is accumulated in the file:

out_filename = 'out.json'

The work is conducted in 5 threads and a task queue of 5 units is created (this has a value, for example, when canceling an operation from the keyboard, the queue length indicates how many tasks were sent for execution):

threads_num = 5
queue_len = 5

The following is used as a parser function:

def parse (self, num, url_component, html):
    '''You must override this method.
    Must return a dictionary or None if parsing the page
    impossible
    '''
    parser = MyDummyHTMLParser ()
    parser.feed (html)
    obj = parser.obj
    obj ['url_component'] = url_component
    return parser.obj

DummyParser is a simple version of HTML parser, it is remarkable only because it uses only one standard library and does not require any additional modules. File dummy_parser.py:

from html.parser import HTMLParser

class MyDummyHTMLParser (HTMLParser):
    def __init __ (self):
        super () .__ init __ ()
        self.a_tag = False
        self.h1_tag = False
        self.p_tag = False
        self.obj = {}

    def handle_starttag (self, tag, attrs):
        if tag == 'h1':
            self.h1_tag = True
        elif tag == 'p':
            self.p_tag = True
        elif tag == 'a':
            self.a_tag = True
            for (attr, value,) in attrs:
                if attr == 'href':
                    self.obj ['link'] = value

    def handle_endtag (self, tag):
        if tag == 'h1':
            self.h1_tag = False
        elif tag == 'p':
            self.p_tag = False
        elif tag == 'a':
            self.a_tag = False

    def handle_data (self, data):
        if self.h1_tag:
            self.obj ['header'] = data
        elif self.p_tag and not self.a_tag:
            self.obj ['article'] = data

This approach is used only to demonstrate the capabilities of multithreading, in real projects it is recommended to use the lxml or BS libraries, a more advanced application will be shown below in the section "Advanced Application"

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.4.0

Jul 28, 2019

0.3.5

Jul 18, 2019

0.3.4

Jul 18, 2019

0.3.3

Jul 16, 2019

0.3.2

Jul 4, 2019

0.3.0

Jul 3, 2019

0.2.3

Jul 1, 2019

0.2.2

Jun 29, 2019

0.2.1

Jun 29, 2019

0.2

Jun 29, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mt_scraper-0.4.0.tar.gz (6.9 kB view details)

Uploaded Jul 28, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mt_scraper-0.4.0-py3-none-any.whl (7.5 kB view details)

Uploaded Jul 28, 2019 Python 3

File details

Details for the file mt_scraper-0.4.0.tar.gz.

File metadata

Download URL: mt_scraper-0.4.0.tar.gz
Upload date: Jul 28, 2019
Size: 6.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for mt_scraper-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`9d7f6754dbc164f597b6efd283118c432e100866f4db73eecab92387580ee77b`
MD5	`eaf96aff9fa6af75477111a4492e3881`
BLAKE2b-256	`5c0ee1d6361cd2eb00346137b424f8d9dc411c6dbc022f4bf162b50c298c49bb`

See more details on using hashes here.

File details

Details for the file mt_scraper-0.4.0-py3-none-any.whl.

File metadata

Download URL: mt_scraper-0.4.0-py3-none-any.whl
Upload date: Jul 28, 2019
Size: 7.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for mt_scraper-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a475060849a58159dd22600dad08fc807e310576dc6fa15cf576614923865d30`
MD5	`978ce3c9b53a7b8a525f730405292dca`
BLAKE2b-256	`9915b7687dbf880e32fe4b5c0ec29541d029415c0824f7427b7dc4e3b07273c6`

See more details on using hashes here.

mt-scraper 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MulTithreaded SCRAPER

Description

Application

Simple application

Main Library Usage Scenario

What happens when this happens

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes