Skip to main content

Develop and build web spider cluster humanly.

Project description

SmoothCrawler-Cluster

PyPI support versions PyPI package version GitHub release version Software license CI/CD status Test coverage Coding style reformat tool Coding style checking tool Pre-Commit building state Code quality level Documentation Status

SmoothCrawler-Cluster is a Python framework which is encapsulation of building cluster or decentralized crawler system humanly with SmoothCrawler.

Overview | Quickly Demo | Documentation


Overview

SmoothCrawler helps you build crawler with multiple components as combining LEGO. SmoothCrawler-Cluster helps you build a cluster or decentralized system with the LEGO. It's same as the reason why SmoothCrawler exist: SoC (Separation of Concerns). Developers could focus on how to handle everything of HTTP request and HTTP response, how to parse the content of HTTP response, etc. In addiction to the crawler features, it also has the cluster or decentralized system feature.

Quickly Demo

For the demonstration, it divides to 2 parts:

  • General crawler feature

    Demonstrate a general crawling feature, but doesn't have any features are relative with cluster or decentralized system.

  • Cluster feature

    Here would let developers be aware of how it runs as a cluster system which is high reliability.

General crawler feature

Currently, it only supports cluster feature with third party application Zookeeper. So let's start to demonstrate with object ZookeeperCrawler:

from smoothcrawler_cluster.crawler import ZookeeperCrawler

zk_crawler = ZookeeperCrawler(runner=1,    # How many crawler to run task
                              backup=1,    # How many crawler is backup of runner
                              ensure_initial=True,    # Run the initial process first
                              zk_hosts="localhost:2181")    # Zookeeper hosts
zk_crawler.register_factory(http_req_sender=RequestsHTTPRequest(),
                            http_resp_parser=RequestsExampleHTTPResponseParser(),
                            data_process=ExampleDataHandler())
zk_crawler.run()

It would run as an unlimited loop after calling run. If it wants to trigger the crawler instance to run crawling task, please assigning task via setting value to Zookeeper node.

Note Please run the above Python code as 2 different processes, e.g., open 2 terminate tabs or windows and run above Python code in each one.

from kazoo.client import KazooClient
from smoothcrawler_cluster.model import Initial
import json

# Initial task data
task = Initial.task(running_content=[{
    "task_id": 0,
    "url": "https://www.example.com",
    "method": "GET",
    "parameters": {},
    "header": {},
    "body": {}
}])

# Set the task value
zk_client = KazooClient(hosts="localhost:2181")
zk_client.start()
zk_client.set(path="/smoothcrawler/node/sc-crawler_1/task", value=bytes(json.dumps(task.to_readable_object()), "utf-8"))

After assigning task to crawler instance, it would run the task and save the result back to Zookeeper.

[zk: localhost:2181(CONNECTED) 19] get /smoothcrawler/node/sc-crawler_1/task
{"running_content": [], "cookie": {}, "authorization": {}, "in_progressing_id": "-1", "running_result": {"success_count": 1,
"fail_count": 0}, "running_status": "done", "result_detail": [{"task_id": 0, "state": "done", "status_code": 200, "response":
"Example Domain", "error_msg": null}]}

From above info, we could get the running result detail in column result_detail:

[
  {
    "task_id": 0,
    "state": "done",
    "status_code": 200,
    "response": "Example Domain",
    "error_msg": null
  }
]

Above data means the task which task_id is 0 it has done, and the HTTP status code it got is 200. Also it got the parsing result: Example Domain.

Cluster feature

Now we understand how to use it as web spider, but what does it mean below?

... how it runs as a cluster system which is high reliability.

Do you remember we run 2 crawler instances, right? Let's check the info about GroupState of these crawler instances:

[zk: localhost:2181(CONNECTED) 10] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 1, "standby_id": "2", "current_crawler": ["sc-crawler_1", "sc-crawler_2"],
"current_runner": ["sc-crawler_1"], "current_backup": ["sc-crawler_2"], "fail_crawler": [], "fail_runner": [], "fail_backup": []}

It shows that it only one instance is Runner and would receive tasks to run right now. So let's try to stop or kill the Runner one and observe the crawler instances behavior.

Note If you opened 2 terminate tabs or windows to run, please select the first one you run and run control + C.

You would observe that the Backup one would activate by itself to be Runner and the original Runner one would be recorded at column fail_crawler and fail_runner.

[zk: localhost:2181(CONNECTED) 11] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 2, "total_runner": 1, "total_backup": 0, "standby_id": "3", "current_crawler": ["sc-crawler_2"], "current_runner":
["sc-crawler_2"], "current_backup": [], "fail_crawler": ["sc-crawler_1"], "fail_runner": ["sc-crawler_1"], "fail_backup": []}

The crawler instance sc-crawler_2 would be the new Runner one to wait for task and run. And you also could test its crawling feature as General crawler feature.

So far, it demonstrates it besides helps developers to build web crawler as a clean software architecture, it also has cluster feature to let it be a high reliability crawler.

Documentation

The documentation contains more details, and demonstrations.

  • Quickly Start to build your own crawler cluster with SmoothCrawler-Cluster
  • Detail SmoothCrawler-Cluster usage information of functions, classes and methods in API References
  • I'm clear what I need and want to customize something of SmoothCrawler-Cluster
  • Not sure how to use SmoothCrawler-Cluster and design your crawler cluster? Usage Guides could be a good guide for you
  • Be curious about the details of SmoothCrawler-Cluster development? Development Documentation would be helpful to you
  • The Release Notes of SmoothCrawler-Cluster

Download

SmoothCrawler still a young open source which keep growing. Here's its download state:

Downloads Downloads

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmoothCrawler-Cluster-0.2.0.tar.gz (87.5 kB view hashes)

Uploaded Source

Built Distribution

SmoothCrawler_Cluster-0.2.0-py3-none-any.whl (109.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page