Build crawler humanly as different roles which be combined with different components.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SmoothCrawler

OS	Building Status	Coverage Status
Linux/MacOS
Windows

SmoothCrawler is a Python framework for being faster and easier to build crawler (or be called web spider). The core concept of its implementation is SoC (Separation of Concerns). It could build crawler humanly as different roles which be combined with different components.

Overview | Quickly Demo | Documentation | Code Example

Overview

Implementing a web crawler in Python is very easy and simple. It already has many frameworks or libraries to do it. However, they focus on one point. It means that they all have their own responsibility to face different things:

For HTTP, you must think about urllib3 or requests.
For parsing HTTP response, BeautifulSoup (bs4).
A framework to do it, scrapy or selenium.

How about a library to build a crawler system?

Every crawler should do mostly same things and procedures:

In generally, a crawler code usually be unstable and even be difficult (e.g. parsing a complex HTML elements content). So it's keeping facing many challenges when you're developing web spider, much less maintain the crawler program (for example, web element locations changing will be your nightmare) or change requirement.

smoothcrawler like LEGO blocks, it classifies crawling to be some components. Every component has its own responsibility to do something. Components could reuse others if it needs. One component focus one thing. Finally, the components combines to form a crawler.

Quickly Demo

Install smoothcrawler via pip:

pip install smoothcrawler

Let's write a simple crawler to crawl data.

Component 1: Send HTTP requests

Implement with Python package requests. Of course, it could implement by urllib3, too.

from smoothcrawler.components.httpio import HTTP
import requests

class FooHTTPRequest(HTTP):

    __Http_Response = None

    def get(self, url: str, *args, **kwargs):
            self.__Http_Response = requests.get(url)
            return self.__Http_Response

Component 2: Get and parse HTTP response

Get the HTTP response object and parse the content data from it.

from smoothcrawler.components.data import BaseHTTPResponseParser
from bs4 import BeautifulSoup
import requests


class FooHTTPResponseParser(BaseHTTPResponseParser):

    def get_status_code(self, response: requests.Response) -> int:
        return response.status

    def handling_200_response(self, response: requests.Response) -> str:
        _bs = BeautifulSoup(response.text, "html.parser")
        _example_web_title = _bs.find_all("h1")
        return _example_web_title[0].text

Component 3: Handle data processing

Demonstrate it could do some data processing here.

from smoothcrawler.components.data import BaseDataHandler

class FooDataHandler(BaseDataHandler):

    def process(self, result):
        return "This is the example.com website header text: " + result

Product: Components combine to form a crawler

It has 3 components now: HTTP sender, HTTP response parser and data processing handler. They could combine to form a crawler and crawl data from target URL(s) via crawler role SimpleCrawler.

from smoothcrawler.crawler import SimpleCrawler
from smoothcrawler.factory import CrawlerFactory

_cf = CrawlerFactory()
_cf.http_factory = FooHTTPRequest()
_cf.parser_factory = FooHTTPResponseParser()
_cf.data_handling_factory = FooDataHandler()

# Crawler Role: Simple Crawler
sc = SimpleCrawler(factory=_cf)
data = sc.run("GET", "http://www.example.com")
print(data)
# This is the example.com website header text: Example Domain

Be more easier implementation in one object.

You may think: come on, I just want to get a simple data easily, so I don't want to divergent simple implementations to many different objects. It's not clear and graceful.

Don't worry, it also could implement that in one object which extends SimpleCrawler like following:

from smoothcrawler.crawler import SimpleCrawler
from bs4 import BeautifulSoup
import requests

class ExampleEasyCrawler(SimpleCrawler):

   def send_http_request(self, method: str, url: str, retry: int = 1, *args, **kwargs) -> requests.Response:
       _response = requests.get(url)
       return _response


   def parse_http_response(self, response: requests.Response) -> str:
       _bs = BeautifulSoup(response.text, "html.parser")
       _example_web_title = _bs.find_all("h1")
       return _example_web_title[0].text


   def data_process(self, parsed_response: str) -> str:
       return "This is the example.com website header text: " + parsed_response

Finally, you could instantiate and use it directly:

_example_easy_crawler = ExampleEasyCrawler()    # Instantiate your own crawler object
_example_result = _example_easy_crawler.run("get", "http://www.example.com")    # Run the web spider task with function *run* and get the result
print(_example_result)
# This is the example.com website header text: Example Domain

How the usage easy and code clear is!

Documentation

The documentation contains more details, and examples.

Quickly Start to develop web spider with SmoothCrawler

Download

SmoothCrawler still a young open source which keep growing. Here's its download state:

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.0

Jun 4, 2022

0.1.0

Jan 20, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmoothCrawler-0.2.0.tar.gz (31.5 kB view hashes)

Uploaded Jun 4, 2022 Source

Built Distribution

SmoothCrawler-0.2.0-py3-none-any.whl (39.9 kB view hashes)

Uploaded Jun 4, 2022 Python 3

Hashes for SmoothCrawler-0.2.0.tar.gz

Hashes for SmoothCrawler-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`daf47be5e76fe0b55a8f00dc78b8a536aa1d8bb034bc1d6751a513bb63998bd5`
MD5	`8b10a1b3c982fdee6b364e18917ac8ca`
BLAKE2b-256	`fea9a5cf896ae482b1c2c7071e7c42959c5a64c4080a785f1325eb50810ff31b`

Hashes for SmoothCrawler-0.2.0-py3-none-any.whl

Hashes for SmoothCrawler-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a31058984fc0d3cb5a3986717ab50c5e2eb4c9318b9b9065fe8442e388bafe10`
MD5	`25c394ba65501706ae9947eb5efe14a6`
BLAKE2b-256	`0a28651ee84c656349e49109a7b3f10e5f9f84875085f4da7c8b440ebe380522`