Skip to main content

This project helps you to web scrape html file.

Project description

Parallel HTML Scraper

Test Test Coverage Maintainability Code Climate technical debt Updates PyPI - Python Version PyPI - Downloads Twitter URL

Helps you to web scrape html file in parallel without async / await syntax.

Feature

This project helps you to web scrape html file in parallel without async / await syntax.

Installation

pip install parallelhtmlscraper

Usage

Minimum example:

from bs4 import BeautifulSoup

from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

host_google = 'https://www.google.com'
path_and_content = [
    '',                                                           # Google Search
    '/imghp?hl=EN',                                               # Google Images
    '/shopping?hl=en',                                            # Google Shopping
    '/save',                                                      # Collection
    'https://www.google.com/maps?hl=en',                          # Google Maps
    'https://www.google.com/drive/apps.html',                     # Google drive
    'https://www.google.com/mail/help/intl/en/about.html?vm=r',   # GMail
]

list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)
$ pipenv run python test.py
['\n      Gmail - Email from Google\n    ', 'Google Images', '  Google Maps  ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']

API

ParallelMediaDownloader.execute

class ParallelHtmlScraper:
    """API of parallel HTML scraping."""

    @staticmethod
    def execute(
        base_url: str,
        list_url: Iterable[str],
        analyzer: HtmlAnalyzer[_T],
        *,
        limit: int = 5
        interval: int = 1
    ) -> List[_T]:

base_url: str

Common part of request URL. This will be help to download URLs got from HTML.

list_url: Iterable[str]

List of URL. Method will download them in parallel. Absolute URL having same base URL as base_url also can be specified.

analyzer: HtmlAnalyzer[_T]

The instance extends HtmlAnalyzer to analyze HTML by using BeautifulSoup. Following example will be help to understand its roll:

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

limit: int = 5

Limit number of parallel processes.

interval: int = 1

Interval between each request(second).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelhtmlscraper-0.1.0.tar.gz (10.0 kB view hashes)

Uploaded Source

Built Distribution

parallelhtmlscraper-0.1.0-py3-none-any.whl (13.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page