Build crawler humanly as different roles which be combined with different components.
Project description
SmoothCrawler
OS | Building Status | Coverage Status |
---|---|---|
Linux/MacOS | ||
Windows |
SmoothCrawler is a Python framework for being faster and easier to build crawler (or be called web spider). The core concept of its implementation is SoC (Separation of Concerns). It could build crawler humanly as different roles which be combined with different components.
Overview | Quickly Demo | Documentation | Code Example
Overview
Implementing a web crawler in Python is very easy and simple. It already has many frameworks or libraries to do it. However, they focus on one point. It means that they all have their own responsibility to face different things:
- For HTTP, you must think about urllib3 or requests.
- For parsing HTTP response, BeautifulSoup (bs4).
- A framework to do it, scrapy or selenium.
How about a library to build a crawler system?
Every crawler should do mostly same things and procedures:
In generally, a crawler code usually be unstable and even be difficult (e.g. parsing a complex HTML elements content). So it's keeping facing many challenges when you're developing web spider, much less maintain the crawler program (for example, web element locations changing will be your nightmare) or change requirement.
smoothcrawler like LEGO blocks, it classifies crawling to be some components. Every component has its own responsibility to do something. Components could reuse others if it needs. One component focus one thing. Finally, the components combines to form a crawler.
Quickly Demo
Install smoothcrawler via pip:
pip install smoothcrawler
Let's write a simple crawler to crawl data.
- Component 1: Send HTTP requests
Implement with Python package requests. Of course, it could implement by urllib3, too.
from smoothcrawler.components.httpio import HTTP
import requests
class FooHTTPRequest(HTTP):
__Http_Response = None
def get(self, url: str, *args, **kwargs):
self.__Http_Response = requests.get(url)
return self.__Http_Response
- Component 2: Get and parse HTTP response
Get the HTTP response object and parse the content data from it.
from smoothcrawler.components.data import BaseHTTPResponseParser
from bs4 import BeautifulSoup
import requests
class FooHTTPResponseParser(BaseHTTPResponseParser):
def get_status_code(self, response: requests.Response) -> int:
return response.status
def handling_200_response(self, response: requests.Response) -> str:
_bs = BeautifulSoup(response.text, "html.parser")
_example_web_title = _bs.find_all("h1")
return _example_web_title[0].text
- Component 3: Handle data processing
Demonstrate it could do some data processing here.
from smoothcrawler.components.data import BaseDataHandler
class FooDataHandler(BaseDataHandler):
def process(self, result):
return "This is the example.com website header text: " + result
- Product: Components combine to form a crawler
It has 3 components now: HTTP sender, HTTP response parser and data processing handler. They could combine to form a crawler and crawl data from target URL(s) via crawler role SimpleCrawler.
from smoothcrawler.crawler import SimpleCrawler
from smoothcrawler.factory import CrawlerFactory
_cf = CrawlerFactory()
_cf.http_factory = FooHTTPRequest()
_cf.parser_factory = FooHTTPResponseParser()
_cf.data_handling_factory = FooDataHandler()
# Crawler Role: Simple Crawler
sc = SimpleCrawler(factory=_cf)
data = sc.run("GET", "http://www.example.com")
print(data)
# This is the example.com website header text: Example Domain
- Be more easier implementation in one object.
You may think: come on, I just want to get a simple data easily, so I don't want to divergent simple implementations to many different objects. It's not clear and graceful.
Don't worry, it also could implement that in one object which extends SimpleCrawler like following:
from smoothcrawler.crawler import SimpleCrawler
from bs4 import BeautifulSoup
import requests
class ExampleEasyCrawler(SimpleCrawler):
def send_http_request(self, method: str, url: str, retry: int = 1, *args, **kwargs) -> requests.Response:
_response = requests.get(url)
return _response
def parse_http_response(self, response: requests.Response) -> str:
_bs = BeautifulSoup(response.text, "html.parser")
_example_web_title = _bs.find_all("h1")
return _example_web_title[0].text
def data_process(self, parsed_response: str) -> str:
return "This is the example.com website header text: " + parsed_response
Finally, you could instantiate and use it directly:
_example_easy_crawler = ExampleEasyCrawler() # Instantiate your own crawler object
_example_result = _example_easy_crawler.run("get", "http://www.example.com") # Run the web spider task with function *run* and get the result
print(_example_result)
# This is the example.com website header text: Example Domain
How the usage easy and code clear is!
Documentation
The documentation contains more details, and examples.
- Quickly Start to develop web spider with SmoothCrawler
Download
SmoothCrawler still a young open source which keep growing. Here's its download state:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SmoothCrawler-0.2.0.tar.gz
.
File metadata
- Download URL: SmoothCrawler-0.2.0.tar.gz
- Upload date:
- Size: 31.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | daf47be5e76fe0b55a8f00dc78b8a536aa1d8bb034bc1d6751a513bb63998bd5 |
|
MD5 | 8b10a1b3c982fdee6b364e18917ac8ca |
|
BLAKE2b-256 | fea9a5cf896ae482b1c2c7071e7c42959c5a64c4080a785f1325eb50810ff31b |
File details
Details for the file SmoothCrawler-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: SmoothCrawler-0.2.0-py3-none-any.whl
- Upload date:
- Size: 39.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a31058984fc0d3cb5a3986717ab50c5e2eb4c9318b9b9065fe8442e388bafe10 |
|
MD5 | 25c394ba65501706ae9947eb5efe14a6 |
|
BLAKE2b-256 | 0a28651ee84c656349e49109a7b3f10e5f9f84875085f4da7c8b440ebe380522 |