This project helps you to web scrape html file.
Project description
Parallel HTML Scraper
Helps you to web scrape html file in parallel without async / await syntax.
Feature
This project helps you to web scrape html file in parallel without async / await syntax.
Installation
pip install parallelhtmlscraper
Usage
Minimum example:
from bs4 import BeautifulSoup
from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper
class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text
host_google = 'https://www.google.com'
path_and_content = [
'', # Google Search
'/imghp?hl=EN', # Google Images
'/shopping?hl=en', # Google Shopping
'/save', # Collection
'https://www.google.com/maps?hl=en', # Google Maps
'https://www.google.com/drive/apps.html', # Google drive
'https://www.google.com/mail/help/intl/en/about.html?vm=r', # GMail
]
list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)
$ pipenv run python test.py
['\n Gmail - Email from Google\n ', 'Google Images', ' Google Maps ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']
API
ParallelMediaDownloader.execute
class ParallelHtmlScraper:
"""API of parallel HTML scraping."""
@staticmethod
def execute(
base_url: str,
list_url: Iterable[str],
analyzer: HtmlAnalyzer[_T],
*,
limit: int = 5
interval: int = 1
) -> List[_T]:
base_url: str
Common part of request URL. This will be help to download URLs got from HTML.
list_url: Iterable[str]
List of URL. Method will download them in parallel.
Absolute URL having same base URL as base_url
also can be specified.
analyzer: HtmlAnalyzer[_T]
The instance extends HtmlAnalyzer
to analyze HTML by using BeautifulSoup.
Following example will be help to understand its roll:
class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text
limit: int = 5
Limit number of parallel processes.
interval: int = 1
Interval between each request(second).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parallelhtmlscraper-0.1.0.tar.gz
(10.0 kB
view hashes)
Built Distribution
Close
Hashes for parallelhtmlscraper-0.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07c2bb595aadf50cbcee93f2f0deba3d699ef8a8426a4fb75c7c4993248dccb1 |
|
MD5 | d2aff3a0946b6773afd8b03d1dfdcd7a |
|
BLAKE2b-256 | 4206e4f1d381d474177aae2cf685ae964362d2ca61ab16cdbf3aca291fa5159d |
Close
Hashes for parallelhtmlscraper-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ca1b7e4ffd04422e2665ba48e164ba6aa993d3e1ea965ff5efb4e6a475cf516c |
|
MD5 | 5258e9af184ff5f633b0ca8e0afc72df |
|
BLAKE2b-256 | 9f61706e4e80a7722d72177dbb88f424ced326d7e70e68919957d04451be4564 |