Skip to main content

Python 3.x Web Crawler, Images, Urls, Emails, Phone numbers

Project description

How To Install

pip install HiveWebCrawler

v0.1.5 News

  • With v0.1.5, erroneous scraping scenarios have been significantly reduced.
  • Image scraping has been improved.

Example:

Sending requests:

# Code
from HiveWebCrawler.Crawler import WebCrawler
CrawlerToolkit = WebCrawler()
request_data = CrawlerToolkit.send_request(target_url="https://google.com")
print(request_data.keys())


# Output 
dict_keys(['success', 'message', 'url', 'status_code', 'timeout_val', 'method', 'data'])

Crawling link from response:

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling links
crawled_links = CrawlerToolkit.crawl_links_from_pesponse_href(
    original_target_url="https://google.com",   # For feedback
    response_text=request_data["data"]
    )

# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled links
for single_list in crawled_links["data_array"]:
    print(single_list)



# OUTPUT

dict_keys(['success', 'data_array', 'original_url', 'message']) # dict keys

# Crawled links
['https://www.google.com/imghp?hl=tr&tab=wi', None]
['https://maps.google.com.tr/maps?hl=tr&tab=wl', None]
['https://play.google.com/?hl=tr&tab=w8', None]
['https://www.youtube.com/?tab=w1', None]
['https://news.google.com/?tab=wn', None]
['https://mail.google.com/mail/?tab=wm', None]
['https://drive.google.com/?tab=wo', None]
['https://www.google.com.tr/intl/tr/about/products?tab=wh', None]
['http://www.google.com.tr/history/optout?hl=tr', None]
['https://google.com/preferences?hl=tr', None]
['https://accounts.google.com/ServiceLogin?hl=tr&passive=true&continue=https://www.google.com/&ec=GAZAAQ', None]
['https://google.com/advanced_search?hl=tr&authuser=0', None]
['https://google.com/intl/tr/ads/', None]
['http://www.google.com.tr/intl/tr/services/', None]
['https://google.com/intl/tr/about.html', None]
['https://www.google.com/setprefdomain?prefdom=TR&prev=https://www.google.com.tr/&sig=K_nBMpLM40cwVr7j5Oqk31t_0TCeo%3D', None]
['https://google.com/intl/tr/policies/privacy/', None]
['https://google.com/intl/tr/policies/terms/', None]
                                                            

Crawling Image From Response:

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling Images
crawled_links = CrawlerToolkit.crawl_image_from_response(
    original_url="https://google.com",
    response_text=request_data["data"]
    )


# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled Images
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 

dict_keys(['success', 'data_array', 'original_url']) # dict keys

# Crawled Images
['https://google.com/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png', 'Google', None]
['https://google.com/textinputassistant/tia.png', None, None]

Crawling Phone Number OR Email address

E-mail

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling email/s
crawled_links = CrawlerToolkit.crawl_email_address_from_response_href(response_text=request_data["data"])



# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled email/s
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 

dict_keys(['success', 'data_array', 'message']) # dict keys

# Crawled emails
[None, 'CENSORED@hurriyet.com.tr']
                                      

Phone Numbers

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling phone numbers
crawled_links = CrawlerToolkit.crawl_phone_number_from_response_href(response_text=request_data["data"])



# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled phone numbers
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 
dict_keys(['success', 'data_array', 'message']) # dict keys
[None, '+90XXXXXXXXXXX'] # Crawled phone numbers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

HiveWebCrawler-0.1.6-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file HiveWebCrawler-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: HiveWebCrawler-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.9

File hashes

Hashes for HiveWebCrawler-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 7a4a5ffb967d0328ce4ef1f8dcc64a2ee36a1b0767cee45b0b3ef27cdb0ebcb7
MD5 093db8a30af426366144f6a4e877fc8e
BLAKE2b-256 8655b64cc24a1857bb623125c0fc5e2ca954b1b676416102f13745da86fd33a1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page