Skip to main content

Build crawler humanly as different roles which be combined with different components.

Project description

SmoothCrawler

Supported Versions Release PyPI version License Codacy Badge Documentation Status

OS Building Status Coverage Status
Linux/MacOS SmoothCrawler CI/CD codecov
Windows Build status Coverage Status

SmoothCrawler is a Python framework for being faster and easier to build crawler (or be called web spider). The core concept of its implementation is SoC (Separation of Concerns). It could build crawler humanly as different roles which be combined with different components.

Overview | Quickly Demo | Documentation | Code Example


Overview

Implementing a web crawler in Python is very easy and simple. It already has many frameworks or libraries to do it. However, they focus on one point. It means that they all have their own responsibility to face different things:

  • For HTTP, you must think about urllib3 or requests.
  • For parsing HTTP response, BeautifulSoup (bs4).
  • A framework to do it, scrapy or selenium.

How about a library to build a crawler system?

Every crawler should do mostly same things and procedures:

image

In generally, a crawler code usually be unstable and even be difficult (e.g. parsing a complex HTML elements content). So it's keeping facing many challenges when you're developing web spider, much less maintain the crawler program (for example, web element locations changing will be your nightmare) or change requirement.

smoothcrawler like LEGO blocks, it classifies crawling to be some components. Every component has its own responsibility to do something. Components could reuse others if it needs. One component focus one thing. Finally, the components combines to form a crawler.

Quickly Demo

Install smoothcrawler via pip:

pip install smoothcrawler

Let's write a simple crawler to crawl data.

  • Component 1: Send HTTP requests

Implement with Python package requests. Of course, it could implement by urllib3, too.

from smoothcrawler.components.httpio import HTTP
import requests

class FooHTTPRequest(HTTP):

    __Http_Response = None

    def get(self, url: str, *args, **kwargs):
            self.__Http_Response = requests.get(url)
            return self.__Http_Response
  • Component 2: Get and parse HTTP response

Get the HTTP response object and parse the content data from it.

from smoothcrawler.components.data import BaseHTTPResponseParser
from bs4 import BeautifulSoup
import requests


class FooHTTPResponseParser(BaseHTTPResponseParser):

    def get_status_code(self, response: requests.Response) -> int:
        return response.status

    def handling_200_response(self, response: requests.Response) -> str:
        _bs = BeautifulSoup(response.text, "html.parser")
        _example_web_title = _bs.find_all("h1")
        return _example_web_title[0].text
  • Component 3: Handle data processing

Demonstrate it could do some data processing here.

from smoothcrawler.components.data import BaseDataHandler

class FooDataHandler(BaseDataHandler):

    def process(self, result):
        return "This is the example.com website header text: " + result
  • Product: Components combine to form a crawler

It has 3 components now: HTTP sender, HTTP response parser and data processing handler. They could combine to form a crawler and crawl data from target URL(s) via crawler role SimpleCrawler.

from smoothcrawler.crawler import SimpleCrawler
from smoothcrawler.factory import CrawlerFactory

_cf = CrawlerFactory()
_cf.http_factory = FooHTTPRequest()
_cf.parser_factory = FooHTTPResponseParser()
_cf.data_handling_factory = FooDataHandler()

# Crawler Role: Simple Crawler
sc = SimpleCrawler(factory=_cf)
data = sc.run("GET", "http://www.example.com")
print(data)
# This is the example.com website header text: Example Domain
  • Be more easier implementation in one object.

You may think: come on, I just want to get a simple data easily, so I don't want to divergent simple implementations to many different objects. It's not clear and graceful.

Don't worry, it also could implement that in one object which extends SimpleCrawler like following:

from smoothcrawler.crawler import SimpleCrawler
from bs4 import BeautifulSoup
import requests

class ExampleEasyCrawler(SimpleCrawler):

   def send_http_request(self, method: str, url: str, retry: int = 1, *args, **kwargs) -> requests.Response:
       _response = requests.get(url)
       return _response


   def parse_http_response(self, response: requests.Response) -> str:
       _bs = BeautifulSoup(response.text, "html.parser")
       _example_web_title = _bs.find_all("h1")
       return _example_web_title[0].text


   def data_process(self, parsed_response: str) -> str:
       return "This is the example.com website header text: " + parsed_response

Finally, you could instantiate and use it directly:

_example_easy_crawler = ExampleEasyCrawler()    # Instantiate your own crawler object
_example_result = _example_easy_crawler.run("get", "http://www.example.com")    # Run the web spider task with function *run* and get the result
print(_example_result)
# This is the example.com website header text: Example Domain

How the usage easy and code clear is!

Documentation

The documentation contains more details, and examples.

Download

SmoothCrawler still a young open source which keep growing. Here's its download state:

Downloads Downloads

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmoothCrawler-0.2.0.tar.gz (31.5 kB view details)

Uploaded Source

Built Distribution

SmoothCrawler-0.2.0-py3-none-any.whl (39.9 kB view details)

Uploaded Python 3

File details

Details for the file SmoothCrawler-0.2.0.tar.gz.

File metadata

  • Download URL: SmoothCrawler-0.2.0.tar.gz
  • Upload date:
  • Size: 31.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.4

File hashes

Hashes for SmoothCrawler-0.2.0.tar.gz
Algorithm Hash digest
SHA256 daf47be5e76fe0b55a8f00dc78b8a536aa1d8bb034bc1d6751a513bb63998bd5
MD5 8b10a1b3c982fdee6b364e18917ac8ca
BLAKE2b-256 fea9a5cf896ae482b1c2c7071e7c42959c5a64c4080a785f1325eb50810ff31b

See more details on using hashes here.

File details

Details for the file SmoothCrawler-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for SmoothCrawler-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a31058984fc0d3cb5a3986717ab50c5e2eb4c9318b9b9065fe8442e388bafe10
MD5 25c394ba65501706ae9947eb5efe14a6
BLAKE2b-256 0a28651ee84c656349e49109a7b3f10e5f9f84875085f4da7c8b440ebe380522

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page