Skip to main content

A Task Queue Scheduler Framework.

Project description

https://travis-ci.org/MacHu-GWU/pytq-project.svg?branch=master https://codecov.io/gh/MacHu-GWU/pytq-project/branch/master/graph/badge.svg https://img.shields.io/pypi/v/pytq.svg https://img.shields.io/pypi/l/pytq.svg https://img.shields.io/pypi/pyversions/pytq.svg https://img.shields.io/badge/Star_Me_on_GitHub!--None.svg?style=social

Welcome to pytq Documentation

pytq (Python Task Queue) is a task scheduler library.

Problem we solve:

  1. You have N task to do.

  2. each task has input_data, and after been processed, we got output_data.

pytq provide these feature out-of-the-box (And it’s all customizable).

  1. Save output_data to data-persistence system.

  2. Filter out duplicate input data.

  3. Built-in Multi thread processor boost the speed.

  4. Nice built-in log system.

  5. And its easy to define how you gonna:
    • process your input_data

    • integrate with your data persistence system

    • filter duplicates input_data

    • retrive output_data

Example

Suppose you have some url to crawl, and you don’t want to crawl those url you successfully crawled, and also you want to save your crawled data in database.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
This script implement multi-thread safe, a sqlite backed task queue scheduler.
"""

from pytq import SqliteDictScheduler


# Define your input_data model
class UrlRequest(object):
    def __init__(self, url, context_data=None):
        self.url = url # your have url to crawl
        self.context_data = context_data # and maybe some context data to use


class Scheduler(SqliteDictScheduler):
    # (Required) define how you gonna process your data
    def user_process(self, input_data):
        # you need to implement get_html_from_url yourself
        html = get_html_from_url(input_data.url)

        # you need to implement parse_html yourself
        output_data = parse_html(html)
        return output_data

s = Scheduler(user_db_path="example.sqlite")

# let's have some urls
input_data_queue = [
    UrlRequest(url="https://pypi.python.org/pypi/pytq"),
    UrlRequest(url="https://pypi.python.org/pypi/crawlib"),
    UrlRequest(url="https://pypi.python.org/pypi/loggerFactory"),
]

# execute multi thread process
s.do(input_data_queue, multiprocess=True)

# print output
for id, outpupt_data in s.items():
    ...

Customize:

class Scheduler(SqliteDictScheduler):
    # (Optional) define the identifier of input_data (for duplicate)
    def user_hash_input(self, input_data):
        return input_data.url

    # (Optional) define how do you save output_data to database
    # Here we just use the default one
    def user_post_process(self, task):
        self._default_post_process(task)

    # (Optional) define how do you skip crawled url
    # Here we just use the default one
    def user_is_duplicate(self, task):
        return self._default_is_duplicate(task)

TODO: more example is coming.

Install

pytq is released on PyPI, so all you need is:

$ pip install pytq

To upgrade to latest version:

$ pip install --upgrade pytq

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytq-0.0.7.tar.gz (46.6 kB view details)

Uploaded Source

Built Distribution

pytq-0.0.7-py2-none-any.whl (67.2 kB view details)

Uploaded Python 2

File details

Details for the file pytq-0.0.7.tar.gz.

File metadata

  • Download URL: pytq-0.0.7.tar.gz
  • Upload date:
  • Size: 46.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pytq-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a8595a8e60e9f1f85f3526f3cdeccb7fe94c753f998af0016c4dc87ccd0f1452
MD5 46f614fb8552ac2d46930f33ae2013e9
BLAKE2b-256 300819fe1ea16ab2182cf6538f6fc7b177b80c8069b3dc35dc132fab5d22d5df

See more details on using hashes here.

File details

Details for the file pytq-0.0.7-py2-none-any.whl.

File metadata

File hashes

Hashes for pytq-0.0.7-py2-none-any.whl
Algorithm Hash digest
SHA256 5b8fb61a6f78438868ae3e4b9eef2f4df97ad7b15c3ff727332fcf6ed19d5d0d
MD5 b27bc5014c8da24b2c7042f4963876c5
BLAKE2b-256 cd7e1a6a6027b80d5b41c8982eef03c0b1f975ed5dcb03cafc3cca391a228825

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page