Develop and build web spider cluster humanly.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SmoothCrawler-Cluster

SmoothCrawler-Cluster is a Python framework which is encapsulation of building cluster or decentralized crawler system humanly with SmoothCrawler.

Overview | Quickly Demo | Documentation

Overview

SmoothCrawler helps you build crawler with multiple components as combining LEGO. SmoothCrawler-Cluster helps you build a cluster or decentralized system with the LEGO. It's same as the reason why SmoothCrawler exist: SoC (Separation of Concerns). Developers could focus on how to handle everything of HTTP request and HTTP response, how to parse the content of HTTP response, etc. In addiction to the crawler features, it also has the cluster or decentralized system feature.

Quickly Demo

For the demonstration, it divides to 2 parts:

General crawler feature

Demonstrate a general crawling feature, but doesn't have any features are relative with cluster or decentralized system.
Cluster feature

Here would let developers be aware of how it runs as a cluster system which is high reliability.

General crawler feature

Currently, it only supports cluster feature with third party application Zookeeper. So let's start to demonstrate with object ZookeeperCrawler:

from smoothcrawler_cluster.crawler import ZookeeperCrawler

zk_crawler = ZookeeperCrawler(runner=1,    # How many crawler to run task
                              backup=1,    # How many crawler is backup of runner
                              ensure_initial=True,    # Run the initial process first
                              zk_hosts="localhost:2181")    # Zookeeper hosts
zk_crawler.register_factory(http_req_sender=RequestsHTTPRequest(),
                            http_resp_parser=RequestsExampleHTTPResponseParser(),
                            data_process=ExampleDataHandler())
zk_crawler.run()

It would run as an unlimited loop after calling run. If it wants to trigger the crawler instance to run crawling task, please assigning task via setting value to Zookeeper node.

Note Please run the above Python code as 2 different processes, e.g., open 2 terminate tabs or windows and run above Python code in each one.

from kazoo.client import KazooClient
from smoothcrawler_cluster.model import Initial
import json

# Initial task data
task = Initial.task(running_content=[{
    "task_id": 0,
    "url": "https://www.example.com",
    "method": "GET",
    "parameters": {},
    "header": {},
    "body": {}
}])

# Set the task value
zk_client = KazooClient(hosts="localhost:2181")
zk_client.start()
zk_client.set(path="/smoothcrawler/node/sc-crawler_1/task", value=bytes(json.dumps(task.to_readable_object()), "utf-8"))

After assigning task to crawler instance, it would run the task and save the result back to Zookeeper.

[zk: localhost:2181(CONNECTED) 19] get /smoothcrawler/node/sc-crawler_1/task
{"running_content": [], "cookie": {}, "authorization": {}, "in_progressing_id": "-1", "running_result": {"success_count": 1,
"fail_count": 0}, "running_status": "done", "result_detail": [{"task_id": 0, "state": "done", "status_code": 200, "response":
"Example Domain", "error_msg": null}]}

From above info, we could get the running result detail in column result_detail:

[
  {
    "task_id": 0,
    "state": "done",
    "status_code": 200,
    "response": "Example Domain",
    "error_msg": null
  }
]

Above data means the task which task_id is 0 it has done, and the HTTP status code it got is 200. Also it got the parsing result: Example Domain.

Cluster feature

Now we understand how to use it as web spider, but what does it mean below?

... how it runs as a cluster system which is high reliability.

Do you remember we run 2 crawler instances, right? Let's check the info about GroupState of these crawler instances:

[zk: localhost:2181(CONNECTED) 10] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 1, "standby_id": "2", "current_crawler": ["sc-crawler_1", "sc-crawler_2"],
"current_runner": ["sc-crawler_1"], "current_backup": ["sc-crawler_2"], "fail_crawler": [], "fail_runner": [], "fail_backup": []}

It shows that it only one instance is Runner and would receive tasks to run right now. So let's try to stop or kill the Runner one and observe the crawler instances behavior.

Note If you opened 2 terminate tabs or windows to run, please select the first one you run and run control + C.

You would observe that the Backup one would activate by itself to be Runner and the original Runner one would be recorded at column fail_crawler and fail_runner.

[zk: localhost:2181(CONNECTED) 11] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 0, "standby_id": "3", "current_crawler": ["sc-crawler_2"], "current_runner":
["sc-crawler_2"], "current_backup": [], "fail_crawler": ["sc-crawler_1"], "fail_runner": ["sc-crawler_1"], "fail_backup": []}

The crawler instance sc-crawler_2 would be the new Runner one to wait for task and run. And you also could test its crawling feature as General crawler feature.

So far, it demonstrates it besides helps developers to build web crawler as a clean software architecture, it also has cluster feature to let it be a high reliability crawler.

Documentation

The documentation contains more details, and demonstrations.

Quickly Start to build your own crawler cluster with SmoothCrawler-Cluster
Detail SmoothCrawler-Cluster usage information of functions, classes and methods in API References
I'm clear what I need and want to customize something of SmoothCrawler-Cluster
Not sure how to use SmoothCrawler-Cluster and design your crawler cluster? Usage Guides could be a good guide for you
Be curious about the details of SmoothCrawler-Cluster development? Development Documentation would be helpful to you
The Release Notes of SmoothCrawler-Cluster

Download

SmoothCrawler still a young open source which keep growing. Here's its download state:

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.0

Feb 23, 2023

0.1.0

Jan 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmoothCrawler-Cluster-0.2.0.tar.gz (87.5 kB view hashes)

Uploaded Feb 23, 2023 Source

Built Distribution

SmoothCrawler_Cluster-0.2.0-py3-none-any.whl (109.0 kB view hashes)

Uploaded Feb 23, 2023 Python 3

Hashes for SmoothCrawler-Cluster-0.2.0.tar.gz

Hashes for SmoothCrawler-Cluster-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9436ddb0753fc1df53525e30fc5069371edda599e7dfe8059ea492a01edcb658`
MD5	`be7b7c606becdf4f38bc78c8c0afc72a`
BLAKE2b-256	`fc00fa98291a0d90e0c062314b0bfb615bafef26657a0d1c6a675e1d0439986b`

Hashes for SmoothCrawler_Cluster-0.2.0-py3-none-any.whl

Hashes for SmoothCrawler_Cluster-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7acb90151c23d5449bcf7ccc947cf88f0d17718da3beae846c17cf7bb730e99`
MD5	`4c5f126df921141049ae26842caab337`
BLAKE2b-256	`25ddaee3870f612d0299668f05825b912bd7554863b5b55a84b8ac12922947df`