Skip to main content

A common package for crawling

Project description

Common package for crawler project

This package contains base classes to implement site-specific crawlers. It aims to help implementation of specific crawlers easier with plugging what you need into a stable and unified crawling structure

Class Structure

DatabaseMixin

crawler.mixins.database_mixin.DatabaseMixin

This class provides database-related interface/mixin for its child to extend

Summary of class implementation:

init constructor:

  • init parent class

  • setup_database

    • load database env vars from os

    • init database connector (this method instantiates a concrete DB connector extends from BaseDatabaseConnector. Ex: MongoDBConnector )

other database-related methods:

  • make call to self.database_client (instance of class derived from BaseDatabaseConnector) for db works

MessageConnectorMixin

crawler.mixins.message_connector_mixin.MessageConnectorMixin

This class provides message-related interface/mixin for its child to extend

Summary of class implementation:

init constructor:

  • init parent class

  • setup_connector

    • load message broker env vars from os

    • init message broker connector (this method instantiates a concrete Message connector extends from BaseMessageBrokerConnector. Ex: RabbitMQConnector )

other message-related methods:

  • make call to connector's methods (instance of class derived from BaseMessageBrokerConnector) to send and receive messages

BaseDatabaseConnector

crawler.connectors.databases.base.BaseDatabaseConnector

This base class provides interface/mixin for a database-related connector. Instances of this class should be used as a member of a Crawler/Pipline to make calls to database

BaseMessageBrokerConnector

crawler.connectors.message_brokers.base.BaseMessageBrokerConnector

This base class provides interface/mixin for a message-related connector. Instances of this class should be used as a member of a Crawler/Pipline to send/receive messages

BaseItem

crawler.items.base.BaseItem

This base class (extends scrapy.Item) contains base properties for an item. Extends this class for mode detailed item.

BasePipeline

crawler.pipelines.base.BasePipeline

This base class (extends DatabaseMixin, MessageConnectorMixin) provides database-related and message-related interface/mixin for a scrapy pipeline.

BaseCrawler

crawler.base_crawler.BaseCrawler

This base class (extends DatabaseMixin, MessageConnectorMixin) provides database-related and message-related interface/mixin for a Crawler (Ex: Crawler for shopping site main page/category page/product page)

Call crawler.run() from subclass's instance should load all env vars and start all steps of a crawling job

Example:

site = WebsiteCrawler()
site.run()

BaseSpider

crawler.spiders.base.BaseSpider

This base class extends spiders.Spider

The inherited def parse(self, response, **kwargs): method from spiders.Spider is implemented with following steps:

  • an items object will be created with default value (set_default_value), this items is yielded to pipeline
  • get_css_selector
  • loop through all item returned by css_selector
    • process returned item, reimplement to check is_element_valid if item need to be filtered again
  • process response and a list contains filtered item again using process_response, return final items object
  • yield items to pipeline

Methods should be inherited/overriden when implementing new spider:

def is_element_valid(self, element): : filter the response's element again to confirm it is a valid element

def process_response(self, response, items, response_elements): : final work before yielding an object t pipeline

def get_new_item_instance(): : get new instance from subclass of BaseItem

def set_default_value(self): : set default value of item

def get_css_selector(self): : css selector to get html element(s) from response

Run demo theamall.com implementation

import os

from dotenv import load_dotenv

from crawler.sites.shopping.website_crawler import WebsiteCrawler

if __name__ == '__main__':
    load_dotenv(os.path.abspath('.env-base'))
    site = WebsiteCrawler()
    site.run()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

msscrawler-1.3.7.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

msscrawler-1.3.7-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file msscrawler-1.3.7.tar.gz.

File metadata

  • Download URL: msscrawler-1.3.7.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for msscrawler-1.3.7.tar.gz
Algorithm Hash digest
SHA256 9374f85aa8eee74b5ad0334fb24e44cf5880cedde77a6ed4492d19f9aa0498fd
MD5 dab6cde2e323c51d29a07eaa546932f1
BLAKE2b-256 018fd34ee7a3e8fe9a7efc7f6cd137e4293c4824481b68b443eae8afc871edf2

See more details on using hashes here.

File details

Details for the file msscrawler-1.3.7-py3-none-any.whl.

File metadata

  • Download URL: msscrawler-1.3.7-py3-none-any.whl
  • Upload date:
  • Size: 24.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for msscrawler-1.3.7-py3-none-any.whl
Algorithm Hash digest
SHA256 832c59990b133ed1feb6b02fc559726342d1b70689baf587cdf0334d90f0bbe5
MD5 88baa10820b1f325b8dc279be976f19f
BLAKE2b-256 9a2efe02b5f78e8b56164d5a32fbd74579f2c0ce551c5f101bffd521e64a1c30

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page