This project helps you to web scrape html file.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Parallel HTML Scraper

Helps you to web scrape html file in parallel without async / await syntax.

Feature

This project helps you to web scrape html file in parallel without async / await syntax.

Installation

pip install parallelhtmlscraper

Usage

Minimum example:

from bs4 import BeautifulSoup

from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

host_google = 'https://www.google.com'
path_and_content = [
    '',                                                           # Google Search
    '/imghp?hl=EN',                                               # Google Images
    '/shopping?hl=en',                                            # Google Shopping
    '/save',                                                      # Collection
    'https://www.google.com/maps?hl=en',                          # Google Maps
    'https://www.google.com/drive/apps.html',                     # Google drive
    'https://www.google.com/mail/help/intl/en/about.html?vm=r',   # GMail
]

list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)

$ pipenv run python test.py
['\n      Gmail - Email from Google\n    ', 'Google Images', '  Google Maps  ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']

API

ParallelMediaDownloader.execute

class ParallelHtmlScraper:
    """API of parallel HTML scraping."""

    @staticmethod
    def execute(
        base_url: str,
        list_url: Iterable[str],
        analyzer: HtmlAnalyzer[_T],
        *,
        limit: int = 5
        interval: int = 1
    ) -> List[_T]:

base_url: str

Common part of request URL. This will be help to download URLs got from HTML.

list_url: Iterable[str]

List of URL. Method will download them in parallel. Absolute URL having same base URL as base_url also can be specified.

analyzer: HtmlAnalyzer[_T]

The instance extends HtmlAnalyzer to analyze HTML by using BeautifulSoup. Following example will be help to understand its roll:

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

limit: int = 5

Limit number of parallel processes.

interval: int = 1

Interval between each request(second).

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Aug 17, 2020

0.0.0

Aug 17, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelhtmlscraper-0.1.0.tar.gz (10.0 kB view hashes)

Uploaded Aug 17, 2020 Source

Built Distribution

parallelhtmlscraper-0.1.0-py3-none-any.whl (13.1 kB view hashes)

Uploaded Aug 17, 2020 Python 3

Hashes for parallelhtmlscraper-0.1.0.tar.gz

Hashes for parallelhtmlscraper-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`07c2bb595aadf50cbcee93f2f0deba3d699ef8a8426a4fb75c7c4993248dccb1`
MD5	`d2aff3a0946b6773afd8b03d1dfdcd7a`
BLAKE2b-256	`4206e4f1d381d474177aae2cf685ae964362d2ca61ab16cdbf3aca291fa5159d`

Hashes for parallelhtmlscraper-0.1.0-py3-none-any.whl

Hashes for parallelhtmlscraper-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca1b7e4ffd04422e2665ba48e164ba6aa993d3e1ea965ff5efb4e6a475cf516c`
MD5	`5258e9af184ff5f633b0ca8e0afc72df`
BLAKE2b-256	`9f61706e4e80a7722d72177dbb88f424ced326d7e70e68919957d04451be4564`