Skip to main content

Build crawler humanly as different roles which be combined with different components.

Project description

SmoothCrawler

Supported Versions Release PyPI version License Codacy Badge Documentation Status

OS Building Status Coverage Status
Linux/MacOS SmoothCrawler CI/CD codecov
Windows Build status Coverage Status

SmoothCrawler is a Python framework for being faster and easier to build crawler (or be called web spider). The core concept of its implementation is SoC (Separation of Concerns). It could build crawler humanly as different roles which be combined with different components.

Overview | Quickly Demo | Documentation | Code Example


Overview

Implementing a web crawler in Python is very easy and simple. It already has many frameworks or libraries to do it. However, they focus on one point. It means that they all have their own responsibility to face different things:

  • For HTTP, you must think about urllib3 or requests.
  • For parsing HTTP response, BeautifulSoup (bs4).
  • A framework to do it, scrapy or selenium.

How about a library to build a crawler system?

Every crawler should do mostly same things and procedures:

image

In generally, a crawler code usually be unstable and even be difficult (e.g. parsing a complex HTML elements content). So it's keeping facing many challenges when you're developing web spider, much less maintain the crawler program (for example, web element locations changing will be your nightmare) or change requirement.

smoothcrawler like LEGO blocks, it classifies crawling to be some components. Every component has its own responsibility to do something. Components could reuse others if it needs. One component focus one thing. Finally, the components combines to form a crawler.

Quickly Demo

Install smoothcrawler via pip:

pip install smoothcrawler

Let's write a simple crawler to crawl data.

  • Component 1: Send HTTP requests

Implement with Python package requests. Of course, it could implement by urllib3, too.

from smoothcrawler.components.httpio import HTTP
import requests

class FooHTTPRequest(HTTP):

    __Http_Response = None

    def get(self, url: str, *args, **kwargs):
            self.__Http_Response = requests.get(url)
            return self.__Http_Response
  • Component 2: Get and parse HTTP response

Get the HTTP response object and parse the content data from it.

from smoothcrawler.components.data import BaseHTTPResponseParser
from bs4 import BeautifulSoup
import requests


class FooHTTPResponseParser(BaseHTTPResponseParser):

    def get_status_code(self, response: requests.Response) -> int:
        return response.status

    def handling_200_response(self, response: requests.Response) -> str:
        _bs = BeautifulSoup(response.text, "html.parser")
        _example_web_title = _bs.find_all("h1")
        return _example_web_title[0].text
  • Component 3: Handle data processing

Demonstrate it could do some data processing here.

from smoothcrawler.components.data import BaseDataHandler

class FooDataHandler(BaseDataHandler):

    def process(self, result):
        return "This is the example.com website header text: " + result
  • Product: Components combine to form a crawler

It has 3 components now: HTTP sender, HTTP response parser and data processing handler. They could combine to form a crawler and crawl data from target URL(s) via crawler role SimpleCrawler.

from smoothcrawler.crawler import SimpleCrawler
from smoothcrawler.factory import CrawlerFactory

_cf = CrawlerFactory()
_cf.http_factory = FooHTTPRequest()
_cf.parser_factory = FooHTTPResponseParser()
_cf.data_handling_factory = FooDataHandler()

# Crawler Role: Simple Crawler
sc = SimpleCrawler(factory=_cf)
data = sc.run("GET", "http://www.example.com")
print(data)
# This is the example.com website header text: Example Domain
  • Be more easier implementation in one object.

You may think: come on, I just want to get a simple data easily, so I don't want to divergent simple implementations to many different objects. It's not clear and graceful.

Don't worry, it also could implement that in one object which extends SimpleCrawler like following:

from smoothcrawler.crawler import SimpleCrawler
from bs4 import BeautifulSoup
import requests

class ExampleEasyCrawler(SimpleCrawler):

   def send_http_request(self, method: str, url: str, retry: int = 1, *args, **kwargs) -> requests.Response:
       _response = requests.get(url)
       return _response


   def parse_http_response(self, response: requests.Response) -> str:
       _bs = BeautifulSoup(response.text, "html.parser")
       _example_web_title = _bs.find_all("h1")
       return _example_web_title[0].text


   def data_process(self, parsed_response: str) -> str:
       return "This is the example.com website header text: " + parsed_response

Finally, you could instantiate and use it directly:

_example_easy_crawler = ExampleEasyCrawler()    # Instantiate your own crawler object
_example_result = _example_easy_crawler.run("get", "http://www.example.com")    # Run the web spider task with function *run* and get the result
print(_example_result)
# This is the example.com website header text: Example Domain

How the usage easy and code clear is!

Documentation

The documentation contains more details, and examples.

Download

SmoothCrawler still a young open source which keep growing. Here's its download state:

Downloads Downloads

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmoothCrawler-0.2.0.tar.gz (31.5 kB view hashes)

Uploaded Source

Built Distribution

SmoothCrawler-0.2.0-py3-none-any.whl (39.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page