Skip to main content

A scalable headless data fetching library written with python and message queue service to enable quickly and easily prasing web data in a distributive way.

Project description

pifetcher

A scalable headless data fetching library written with python and message queue service to enable quickly and easily prasing web data in a distributive way.

dependencies:

  • selenium
  • BeautifulSoup4
  • boto3 (optional but by default)
  • ChromeDriver for chrome 76(by default)
  • Chrome executable v 76(by default)

how to use:

  1. set up work queue component on the host computer(aws simple queue service by default), such as credentials, regions AWS BOTO3 initial set up docs

  2. configure a fetcher by creating a field mapping config file, for example: create a mapping config file for fetching amazon.com item pricing data

{
    "price": {
        "type": "text",
        "selector": "#priceblock_ourprice",
        "attribute":".text"
    },
    "id": {
        "type": "text",
        "selector": "#ASIN",
        "attribute": "value"
    },
    "title": {
        "type": "text",
        "selector": "#productTitle",
        "attribute":".text"
    }
}
  1. create a pifetcherConfig.json file, and add the fetcher mapping file that previously created to fetcher -> mappingConfigs with its name and file path
    "browser":{
        "browser_options":["--window-size=1920,1080", "--disable-extensions", "--proxy-server='direct://'", "--proxy-bypass-list=*", "--start-maximized","--ignore-certificate-errors", "--headless"],
        "win-driver_path":"chromedriver-win-76.exe",
        "win-binary_location": "",
        "mac-driver_path":"chromedriver-mac-76",
        "mac-binary_location": ""

    },
    "queue":
    {
        "num_works_per_time": 1,
        "queue_type":"AWSSimpleQueueService",
        "queue_name":"datafetch.fifo"
    },
    "logger":
    {
        "output":"console"
    },
    "fetcher":
    {
        "mappingConfigs":{
            "amazon":"amazon.json"
        }

    }
}
  1. to use the fetcher worker
  • import the fetcher worker class and config class
from pifetcher.core import Config
from pifetcher.core import FetchWorker
  • load the pifetherConfig.json to the Config class
Config.use('pifetcherConfig.json')
  • implement event function with your own logic on_save_result : this will be called when a data object has been successfully parsed on_empty_result_error: this will be called after parsing an empty object, you may want to stop/ pause the process to investigate the problem before continuing parsing on_start_process_signal: this will be called when the worker received a start process signaal , you may implement your logic of adding fetching tasks to the queue here

example:

class TestWorker(FetchWorker):
    def on_save_result(self, results):
        print(results)
    def on_empty_result_error(self):
        self.stop()
    def on_start_process_signal(self):
        work = {}
        work['url'] = 'a amazon url'
        work['fetcher_name'] = 'amazon'
        self.add_works([work])
  1. Run the worker and, send a StartProcess Signal to the queue to start the process
  • start the worker to receive and process works
tw = TestWorker()
tw.do_works()
  • to send a start signal to the queue
    sqs = boto3.resource('sqs')
    queue = sqs.get_queue_by_name(QueueName='datafetch.fifo')
    content = {"type":"StartProcess","content":{}}
    queue.send_message(MessageBody=json.dumps(content), MessageGroupId = "FetchWork", MessageDeduplicationId = str(time.time()))

To do list items:

  • simplify initial setup process

Completed items

  • put all constants in config the config file (checked)
  • complete the type conversions for different data types (checked)
  • add message type (work initiation message type) (checked)
  • logging (checked)
  • data fetching with use of aws sqs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pifetcher-0.0.2.5.tar.gz (21.7 MB view details)

Uploaded Source

Built Distribution

pifetcher-0.0.2.5-py3-none-any.whl (21.8 MB view details)

Uploaded Python 3

File details

Details for the file pifetcher-0.0.2.5.tar.gz.

File metadata

  • Download URL: pifetcher-0.0.2.5.tar.gz
  • Upload date:
  • Size: 21.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.4

File hashes

Hashes for pifetcher-0.0.2.5.tar.gz
Algorithm Hash digest
SHA256 fead05dfafc8d71f116625d733eee65692d57bb216edde3ba1b4a13a6ef713f0
MD5 d23adfcfae6b888d9a8d48b6818c60be
BLAKE2b-256 defd2e59f4f7c79aa65ccc86e574c8aa5f557cb3cb89ccd1cef519a40a546ac9

See more details on using hashes here.

File details

Details for the file pifetcher-0.0.2.5-py3-none-any.whl.

File metadata

  • Download URL: pifetcher-0.0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 21.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.4.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.4

File hashes

Hashes for pifetcher-0.0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e96b04735ecd3f268a50b93d2d5709d91add070238985a3354631c86c879347c
MD5 63000b608275be44a3a49766cded06f0
BLAKE2b-256 c6aeacc3e4f8bdee1a303c74c3da7499faa397d9a8e25e3147bbc03aa6d9d10b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page