Python 3.x Web Crawler, Images, Urls, Emails, Phone numbers
Project description
How To Install
pip install HiveWebCrawler
v0.1.5 News
- With v0.1.5, erroneous scraping scenarios have been significantly reduced.
- Image scraping has been improved.
Example:
Sending requests:
# Code
from HiveWebCrawler.Crawler import WebCrawler
CrawlerToolkit = WebCrawler()
request_data = CrawlerToolkit.send_request(target_url="https://google.com")
print(request_data.keys())
# Output
dict_keys(['success', 'message', 'url', 'status_code', 'timeout_val', 'method', 'data'])
Crawling link from response:
# import Crawler
from HiveWebCrawler.Crawler import WebCrawler
# toolkit init
CrawlerToolkit = WebCrawler()
# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")
# checking status
if not request_data["success"]:
print(request_data["message"])
exit(1)
# Crawling links
crawled_links = CrawlerToolkit.crawl_links_from_pesponse_href(
original_target_url="https://google.com", # For feedback
response_text=request_data["data"]
)
# checking status
if not crawled_links["success"]:
print(request_data["message"])
exit(1)
# print dict keys
print(crawled_links.keys())
# print crawled links
for single_list in crawled_links["data_array"]:
print(single_list)
# OUTPUT
dict_keys(['success', 'data_array', 'original_url', 'message']) # dict keys
# Crawled links
['https://www.google.com/imghp?hl=tr&tab=wi', None]
['https://maps.google.com.tr/maps?hl=tr&tab=wl', None]
['https://play.google.com/?hl=tr&tab=w8', None]
['https://www.youtube.com/?tab=w1', None]
['https://news.google.com/?tab=wn', None]
['https://mail.google.com/mail/?tab=wm', None]
['https://drive.google.com/?tab=wo', None]
['https://www.google.com.tr/intl/tr/about/products?tab=wh', None]
['http://www.google.com.tr/history/optout?hl=tr', None]
['https://google.com/preferences?hl=tr', None]
['https://accounts.google.com/ServiceLogin?hl=tr&passive=true&continue=https://www.google.com/&ec=GAZAAQ', None]
['https://google.com/advanced_search?hl=tr&authuser=0', None]
['https://google.com/intl/tr/ads/', None]
['http://www.google.com.tr/intl/tr/services/', None]
['https://google.com/intl/tr/about.html', None]
['https://www.google.com/setprefdomain?prefdom=TR&prev=https://www.google.com.tr/&sig=K_nBMpLM40cwVr7j5Oqk31t_0TCeo%3D', None]
['https://google.com/intl/tr/policies/privacy/', None]
['https://google.com/intl/tr/policies/terms/', None]
Crawling Image From Response:
# import Crawler
from HiveWebCrawler.Crawler import WebCrawler
# toolkit init
CrawlerToolkit = WebCrawler()
# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")
# checking status
if not request_data["success"]:
print(request_data["message"])
exit(1)
# Crawling Images
crawled_links = CrawlerToolkit.crawl_image_from_response(
original_url="https://google.com",
response_text=request_data["data"]
)
# checking status
if not crawled_links["success"]:
print(request_data["message"])
exit(1)
# print dict keys
print(crawled_links.keys())
# print crawled Images
for single_list in crawled_links["data_array"]:
print(single_list)
# OUTPUT
dict_keys(['success', 'data_array', 'original_url']) # dict keys
# Crawled Images
['https://google.com/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png', 'Google', None]
['https://google.com/textinputassistant/tia.png', None, None]
Crawling Phone Number OR Email address
# import Crawler
from HiveWebCrawler.Crawler import WebCrawler
# toolkit init
CrawlerToolkit = WebCrawler()
# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")
# checking status
if not request_data["success"]:
print(request_data["message"])
exit(1)
# Crawling email/s
crawled_links = CrawlerToolkit.crawl_email_address_from_response_href(response_text=request_data["data"])
# checking status
if not crawled_links["success"]:
print(request_data["message"])
exit(1)
# print dict keys
print(crawled_links.keys())
# print crawled email/s
for single_list in crawled_links["data_array"]:
print(single_list)
# OUTPUT
dict_keys(['success', 'data_array', 'message']) # dict keys
# Crawled emails
[None, 'CENSORED@hurriyet.com.tr']
Phone Numbers
# import Crawler
from HiveWebCrawler.Crawler import WebCrawler
# toolkit init
CrawlerToolkit = WebCrawler()
# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")
# checking status
if not request_data["success"]:
print(request_data["message"])
exit(1)
# Crawling phone numbers
crawled_links = CrawlerToolkit.crawl_phone_number_from_response_href(response_text=request_data["data"])
# checking status
if not crawled_links["success"]:
print(request_data["message"])
exit(1)
# print dict keys
print(crawled_links.keys())
# print crawled phone numbers
for single_list in crawled_links["data_array"]:
print(single_list)
# OUTPUT
dict_keys(['success', 'data_array', 'message']) # dict keys
[None, '+90XXXXXXXXXXX'] # Crawled phone numbers
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file HiveWebCrawler-0.1.6-py3-none-any.whl.
File metadata
- Download URL: HiveWebCrawler-0.1.6-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a4a5ffb967d0328ce4ef1f8dcc64a2ee36a1b0767cee45b0b3ef27cdb0ebcb7
|
|
| MD5 |
093db8a30af426366144f6a4e877fc8e
|
|
| BLAKE2b-256 |
8655b64cc24a1857bb623125c0fc5e2ca954b1b676416102f13745da86fd33a1
|