Project description

How To Install

pip install HiveWebCrawler

v0.1.5 News

With v0.1.5, erroneous scraping scenarios have been significantly reduced.
Image scraping has been improved.

Example:

Sending requests:

# Code
from HiveWebCrawler.Crawler import WebCrawler
CrawlerToolkit = WebCrawler()
request_data = CrawlerToolkit.send_request(target_url="https://google.com")
print(request_data.keys())


# Output 
dict_keys(['success', 'message', 'url', 'status_code', 'timeout_val', 'method', 'data'])

Crawling link from response:

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling links
crawled_links = CrawlerToolkit.crawl_links_from_pesponse_href(
    original_target_url="https://google.com",   # For feedback
    response_text=request_data["data"]
    )

# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled links
for single_list in crawled_links["data_array"]:
    print(single_list)



# OUTPUT

dict_keys(['success', 'data_array', 'original_url', 'message']) # dict keys

# Crawled links
['https://www.google.com/imghp?hl=tr&tab=wi', None]
['https://maps.google.com.tr/maps?hl=tr&tab=wl', None]
['https://play.google.com/?hl=tr&tab=w8', None]
['https://www.youtube.com/?tab=w1', None]
['https://news.google.com/?tab=wn', None]
['https://mail.google.com/mail/?tab=wm', None]
['https://drive.google.com/?tab=wo', None]
['https://www.google.com.tr/intl/tr/about/products?tab=wh', None]
['http://www.google.com.tr/history/optout?hl=tr', None]
['https://google.com/preferences?hl=tr', None]
['https://accounts.google.com/ServiceLogin?hl=tr&passive=true&continue=https://www.google.com/&ec=GAZAAQ', None]
['https://google.com/advanced_search?hl=tr&authuser=0', None]
['https://google.com/intl/tr/ads/', None]
['http://www.google.com.tr/intl/tr/services/', None]
['https://google.com/intl/tr/about.html', None]
['https://www.google.com/setprefdomain?prefdom=TR&prev=https://www.google.com.tr/&sig=K_nBMpLM40cwVr7j5Oqk31t_0TCeo%3D', None]
['https://google.com/intl/tr/policies/privacy/', None]
['https://google.com/intl/tr/policies/terms/', None]

Crawling Image From Response:

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://google.com")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling Images
crawled_links = CrawlerToolkit.crawl_image_from_response(
    original_url="https://google.com",
    response_text=request_data["data"]
    )


# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled Images
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 

dict_keys(['success', 'data_array', 'original_url']) # dict keys

# Crawled Images
['https://google.com/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png', 'Google', None]
['https://google.com/textinputassistant/tia.png', None, None]

Crawling Phone Number OR Email address

E-mail

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling email/s
crawled_links = CrawlerToolkit.crawl_email_address_from_response_href(response_text=request_data["data"])



# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled email/s
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 

dict_keys(['success', 'data_array', 'message']) # dict keys

# Crawled emails
[None, 'CENSORED@hurriyet.com.tr']

Phone Numbers

# import Crawler
from HiveWebCrawler.Crawler import WebCrawler

# toolkit init
CrawlerToolkit = WebCrawler()

# sending http/s requests
request_data = CrawlerToolkit.send_request(target_url="https://www.hurriyet.com.tr/bizeulasin/")

# checking status
if not request_data["success"]:
    print(request_data["message"])
    exit(1)
    
# Crawling phone numbers
crawled_links = CrawlerToolkit.crawl_phone_number_from_response_href(response_text=request_data["data"])



# checking status
if not crawled_links["success"]:
    print(request_data["message"])
    exit(1)
    
# print dict keys
print(crawled_links.keys())


# print crawled phone numbers
for single_list in crawled_links["data_array"]:
    print(single_list)


# OUTPUT 
dict_keys(['success', 'data_array', 'message']) # dict keys
[None, '+90XXXXXXXXXXX'] # Crawled phone numbers

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

0.1.6

Jun 7, 2024

This version

0.1.5

Jun 6, 2024

0.1.4

Apr 20, 2024

0.0.4

Apr 15, 2024

0.0.3

Apr 15, 2024

0.0.2

Apr 15, 2024

0.0.1

Apr 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hivewebcrawler-0.1.5.tar.gz (6.1 kB view hashes)

Uploaded Jun 6, 2024 Source

Built Distribution

HiveWebCrawler-0.1.5-py3-none-any.whl (6.5 kB view hashes)

Uploaded Jun 6, 2024 Python 3

Hashes for hivewebcrawler-0.1.5.tar.gz

Hashes for hivewebcrawler-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`d7769eeb08fc87ca5514b214bb87cc3468f997d2ca565feec827bbe189d4a76c`
MD5	`eed494dec7789af029b104e525a30723`
BLAKE2b-256	`8cf17ff6b7e9922d2ef77e1f14b6fb71efca677e736672a8eabc52d661883161`

Hashes for HiveWebCrawler-0.1.5-py3-none-any.whl

Hashes for HiveWebCrawler-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d6978198324f6f0e1d017652fda4aa8d8afe47e887e3821e229219f4c5247efe`
MD5	`b5c1706a30c50a23683666da6a51dab3`
BLAKE2b-256	`3133cb70f0e16f5aa282bec638a1014de9e147e3f3a45ae8c61885698dd3a81c`