This project helps you to web scrape html file.
Project description
Parallel HTML Scraper
Helps you to web scrape html file in parallel without async / await syntax.
Feature
This project helps you to web scrape html file in parallel without async / await syntax.
Installation
pip install parallelhtmlscraper
Usage
Minimum example:
from bs4 import BeautifulSoup
from parallelhtmlscraper.html_analyzer import HtmlAnalyzer
from parallelhtmlscraper.parallel_html_scraper import ParallelHtmlScraper
class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text
host_google = 'https://www.google.com'
path_and_content = [
'', # Google Search
'/imghp?hl=EN', # Google Images
'/shopping?hl=en', # Google Shopping
'/save', # Collection
'https://www.google.com/maps?hl=en', # Google Maps
'https://www.google.com/drive/apps.html', # Google drive
'https://www.google.com/mail/help/intl/en/about.html?vm=r', # GMail
]
list_response = ParallelHtmlScraper.execute(host_google, path_and_content, AnalyzerExample())
print(list_response)
$ pipenv run python test.py
['\n Gmail - Email from Google\n ', 'Google Images', ' Google Maps ', 'Using Google Drive - New Features, Benefits & Advantages of Google Cloud Storage', 'Google Shopping', 'Google', 'Collections']
API
ParallelMediaDownloader.execute
class ParallelHtmlScraper:
"""API of parallel HTML scraping."""
@staticmethod
def execute(
base_url: str,
list_url: Iterable[str],
analyzer: HtmlAnalyzer[_T],
*,
limit: int = 5
interval: int = 1
) -> List[_T]:
base_url: str
Common part of request URL. This will be help to download URLs got from HTML.
list_url: Iterable[str]
List of URL. Method will download them in parallel.
Absolute URL having same base URL as base_url also can be specified.
analyzer: HtmlAnalyzer[_T]
The instance extends HtmlAnalyzer to analyze HTML by using BeautifulSoup.
Following example will be help to understand its roll:
class AnalyzerExample(HtmlAnalyzer):
async def execute(self, soup: BeautifulSoup) -> str:
return soup.find('title').text
limit: int = 5
Limit number of parallel processes.
interval: int = 1
Interval between each request(second).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parallelhtmlscraper-0.1.0.tar.gz.
File metadata
- Download URL: parallelhtmlscraper-0.1.0.tar.gz
- Upload date:
- Size: 10.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07c2bb595aadf50cbcee93f2f0deba3d699ef8a8426a4fb75c7c4993248dccb1
|
|
| MD5 |
d2aff3a0946b6773afd8b03d1dfdcd7a
|
|
| BLAKE2b-256 |
4206e4f1d381d474177aae2cf685ae964362d2ca61ab16cdbf3aca291fa5159d
|
File details
Details for the file parallelhtmlscraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: parallelhtmlscraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca1b7e4ffd04422e2665ba48e164ba6aa993d3e1ea965ff5efb4e6a475cf516c
|
|
| MD5 |
5258e9af184ff5f633b0ca8e0afc72df
|
|
| BLAKE2b-256 |
9f61706e4e80a7722d72177dbb88f424ced326d7e70e68919957d04451be4564
|