Skip to main content

This project helps you to web scrape html file.

Project description

Parallel HTML Scraper

Test Test Coverage Maintainability Code Climate technical debt Updates PyPI - Python Version PyPI - Downloads Twitter URL

Helps you to web scrape html file in parallel without async / await syntax.

Feature

This project helps you to web scrape html file in parallel without async / await syntax.

Installation

pip install parallelhtmlscraper

Usage

Minimum example:

from bs4 import BeautifulSoup

from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

host_google = 'https://www.google.com'
path_and_content = [
    '',                                                           # Google Search
    '/imghp?hl=EN',                                               # Google Images
    '/shopping?hl=en',                                            # Google Shopping
    '/save',                                                      # Collection
    'https://www.google.com/maps?hl=en',                          # Google Maps
    'https://www.google.com/drive/apps.html',                     # Google drive
    'https://www.google.com/mail/help/intl/en/about.html?vm=r',   # GMail
]

list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)
$ pipenv run python test.py
['\n      Gmail - Email from Google\n    ', 'Google Images', '  Google Maps  ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']

API

ParallelMediaDownloader.execute

class ParallelHtmlScraper:
    """API of parallel HTML scraping."""

    @staticmethod
    def execute(
        base_url: str,
        list_url: Iterable[str],
        analyzer: HtmlAnalyzer[_T],
        *,
        limit: int = 5
        interval: int = 1
    ) -> List[_T]:

base_url: str

Common part of request URL. This will be help to download URLs got from HTML.

list_url: Iterable[str]

List of URL. Method will download them in parallel. Absolute URL having same base URL as base_url also can be specified.

analyzer: HtmlAnalyzer[_T]

The instance extends HtmlAnalyzer to analyze HTML by using BeautifulSoup. Following example will be help to understand its roll:

class AnalyzerExample(HtmlAnalyzer):
    async def execute(self, soup: BeautifulSoup) -> str:
        return soup.find('title').text

limit: int = 5

Limit number of parallel processes.

interval: int = 1

Interval between each request(second).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelhtmlscraper-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parallelhtmlscraper-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file parallelhtmlscraper-0.1.0.tar.gz.

File metadata

  • Download URL: parallelhtmlscraper-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for parallelhtmlscraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07c2bb595aadf50cbcee93f2f0deba3d699ef8a8426a4fb75c7c4993248dccb1
MD5 d2aff3a0946b6773afd8b03d1dfdcd7a
BLAKE2b-256 4206e4f1d381d474177aae2cf685ae964362d2ca61ab16cdbf3aca291fa5159d

See more details on using hashes here.

File details

Details for the file parallelhtmlscraper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: parallelhtmlscraper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5

File hashes

Hashes for parallelhtmlscraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca1b7e4ffd04422e2665ba48e164ba6aa993d3e1ea965ff5efb4e6a475cf516c
MD5 5258e9af184ff5f633b0ca8e0afc72df
BLAKE2b-256 9f61706e4e80a7722d72177dbb88f424ced326d7e70e68919957d04451be4564

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page