Develop and build web spider cluster humanly.
Project description
SmoothCrawler-Cluster
SmoothCrawler-Cluster is a Python framework which is encapsulation of building cluster or decentralized crawler system humanly with SmoothCrawler.
Overview | Quickly Demo | Documentation
Overview
SmoothCrawler helps you build crawler with multiple components as combining LEGO. SmoothCrawler-Cluster helps you build a cluster or decentralized system with the LEGO. It's same as the reason why SmoothCrawler exist: SoC (Separation of Concerns). Developers could focus on how to handle everything of HTTP request and HTTP response, how to parse the content of HTTP response, etc. In addiction to the crawler features, it also has the cluster or decentralized system feature.
Quickly Demo
For the demonstration, it divides to 2 parts:
-
Demonstrate a general crawling feature, but doesn't have any features are relative with cluster or decentralized system.
-
Here would let developers be aware of how it runs as a cluster system which is high reliability.
General crawler feature
Currently, it only supports cluster feature with third party application Zookeeper. So let's start to demonstrate with object ZookeeperCrawler:
from smoothcrawler_cluster.crawler import ZookeeperCrawler
zk_crawler = ZookeeperCrawler(runner=1, # How many crawler to run task
backup=1, # How many crawler is backup of runner
ensure_initial=True, # Run the initial process first
zk_hosts="localhost:2181") # Zookeeper hosts
zk_crawler.register_factory(http_req_sender=RequestsHTTPRequest(),
http_resp_parser=RequestsExampleHTTPResponseParser(),
data_process=ExampleDataHandler())
zk_crawler.run()
It would run as an unlimited loop after calling run. If it wants to trigger the crawler instance to run crawling task, please assigning task via setting value to Zookeeper node.
Note Please run the above Python code as 2 different processes, e.g., open 2 terminate tabs or windows and run above Python code in each one.
from kazoo.client import KazooClient
from smoothcrawler_cluster.model import Initial
import json
# Initial task data
task = Initial.task(running_content=[{
"task_id": 0,
"url": "https://www.example.com",
"method": "GET",
"parameters": {},
"header": {},
"body": {}
}])
# Set the task value
zk_client = KazooClient(hosts="localhost:2181")
zk_client.start()
zk_client.set(path="/smoothcrawler/node/sc-crawler_1/task", value=bytes(json.dumps(task.to_readable_object()), "utf-8"))
After assigning task to crawler instance, it would run the task and save the result back to Zookeeper.
[zk: localhost:2181(CONNECTED) 19] get /smoothcrawler/node/sc-crawler_1/task
{"running_content": [], "cookie": {}, "authorization": {}, "in_progressing_id": "-1", "running_result": {"success_count": 1,
"fail_count": 0}, "running_status": "done", "result_detail": [{"task_id": 0, "state": "done", "status_code": 200, "response":
"Example Domain", "error_msg": null}]}
From above info, we could get the running result detail in column result_detail:
[
{
"task_id": 0,
"state": "done",
"status_code": 200,
"response": "Example Domain",
"error_msg": null
}
]
Above data means the task which task_id is 0 it has done, and the HTTP status code it got is 200. Also it got the parsing result: Example Domain.
Cluster feature
Now we understand how to use it as web spider, but what does it mean below?
... how it runs as a cluster system which is high reliability.
Do you remember we run 2 crawler instances, right? Let's check the info about GroupState of these crawler instances:
[zk: localhost:2181(CONNECTED) 10] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 1, "standby_id": "2", "current_crawler": ["sc-crawler_1", "sc-crawler_2"],
"current_runner": ["sc-crawler_1"], "current_backup": ["sc-crawler_2"], "fail_crawler": [], "fail_runner": [], "fail_backup": []}
It shows that it only one instance is Runner and would receive tasks to run right now. So let's try to stop or kill the Runner one and observe the crawler instances behavior.
Note If you opened 2 terminate tabs or windows to run, please select the first one you run and run control + C.
You would observe that the Backup one would activate by itself to be Runner and the original Runner one would be recorded at column fail_crawler and fail_runner.
[zk: localhost:2181(CONNECTED) 11] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 0, "standby_id": "3", "current_crawler": ["sc-crawler_2"], "current_runner":
["sc-crawler_2"], "current_backup": [], "fail_crawler": ["sc-crawler_1"], "fail_runner": ["sc-crawler_1"], "fail_backup": []}
The crawler instance sc-crawler_2 would be the new Runner one to wait for task and run. And you also could test its crawling feature as General crawler feature.
So far, it demonstrates it besides helps developers to build web crawler as a clean software architecture, it also has cluster feature to let it be a high reliability crawler.
Documentation
The documentation contains more details, and demonstrations.
- Quickly Start to build your own crawler cluster with SmoothCrawler-Cluster
- Detail SmoothCrawler-Cluster usage information of functions, classes and methods in API References
- I'm clear what I need and want to customize something of SmoothCrawler-Cluster
- Not sure how to use SmoothCrawler-Cluster and design your crawler cluster? Usage Guides could be a good guide for you
- Be curious about the details of SmoothCrawler-Cluster development? Development Documentation would be helpful to you
- The Release Notes of SmoothCrawler-Cluster
Download
SmoothCrawler still a young open source which keep growing. Here's its download state:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SmoothCrawler-Cluster-0.2.0.tar.gz
.
File metadata
- Download URL: SmoothCrawler-Cluster-0.2.0.tar.gz
- Upload date:
- Size: 87.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9436ddb0753fc1df53525e30fc5069371edda599e7dfe8059ea492a01edcb658 |
|
MD5 | be7b7c606becdf4f38bc78c8c0afc72a |
|
BLAKE2b-256 | fc00fa98291a0d90e0c062314b0bfb615bafef26657a0d1c6a675e1d0439986b |
File details
Details for the file SmoothCrawler_Cluster-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: SmoothCrawler_Cluster-0.2.0-py3-none-any.whl
- Upload date:
- Size: 109.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7acb90151c23d5449bcf7ccc947cf88f0d17718da3beae846c17cf7bb730e99 |
|
MD5 | 4c5f126df921141049ae26842caab337 |
|
BLAKE2b-256 | 25ddaee3870f612d0299668f05825b912bd7554863b5b55a84b8ac12922947df |