A Task Queue Scheduler Framework.
Project description
Welcome to pytq Documentation
pytq (Python Task Queue) is a task scheduler library.
Problem we solve:
You have N task to do.
each task has input_data, and after been processed, we got output_data.
pytq provide these feature out-of-the-box (And it’s all customizable).
Save output_data to data-persistence system.
Filter out duplicate input data.
Built-in Multi thread processor boost the speed.
Nice built-in log system.
- And its easy to define how you gonna:
process your input_data
integrate with your data persistence system
filter duplicates input_data
retrive output_data
Example
Suppose you have some url to crawl, and you don’t want to crawl those url you successfully crawled, and also you want to save your crawled data in database.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
This script implement multi-thread safe, a sqlite backed task queue scheduler.
"""
from pytq import SqliteDictScheduler
# Define your input_data model
class UrlRequest(object):
def __init__(self, url, context_data=None):
self.url = url # your have url to crawl
self.context_data = context_data # and maybe some context data to use
class Scheduler(SqliteDictScheduler):
# (Required) define how you gonna process your data
def user_process(self, input_data):
# you need to implement get_html_from_url yourself
html = get_html_from_url(input_data.url)
# you need to implement parse_html yourself
output_data = parse_html(html)
return output_data
s = Scheduler(user_db_path="example.sqlite")
# let's have some urls
input_data_queue = [
UrlRequest(url="https://pypi.python.org/pypi/pytq"),
UrlRequest(url="https://pypi.python.org/pypi/crawlib"),
UrlRequest(url="https://pypi.python.org/pypi/loggerFactory"),
]
# execute multi thread process
s.do(input_data_queue, multiprocess=True)
# print output
for id, outpupt_data in s.items():
...
Customize:
class Scheduler(SqliteDictScheduler):
# (Optional) define the identifier of input_data (for duplicate)
def user_hash_input(self, input_data):
return input_data.url
# (Optional) define how do you save output_data to database
# Here we just use the default one
def user_post_process(self, task):
self._default_post_process(task)
# (Optional) define how do you skip crawled url
# Here we just use the default one
def user_is_duplicate(self, task):
return self._default_is_duplicate(task)
TODO: more example is coming.
Quick Links
Install
pytq is released on PyPI, so all you need is:
$ pip install pytq
To upgrade to latest version:
$ pip install --upgrade pytq
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytq-0.0.5.tar.gz
.
File metadata
- Download URL: pytq-0.0.5.tar.gz
- Upload date:
- Size: 45.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 92d02f894d289a647f8d2ca5f3d9585e36326b7e91301be9b52238773e30bdef |
|
MD5 | 7b0a72b7b450d360606e155707cfce2f |
|
BLAKE2b-256 | bf89d420b90cb0f41f558f721eec810bea7a84a017e14fc5da246245d13873a5 |
File details
Details for the file pytq-0.0.5-py2-none-any.whl
.
File metadata
- Download URL: pytq-0.0.5-py2-none-any.whl
- Upload date:
- Size: 65.9 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1c48a4cf2d7d7541cb45cf316850a8f55cbc0e258eb20777dc8ca14a287fa2d |
|
MD5 | 520210b35bfa19c8c31fa68bb0bd9f14 |
|
BLAKE2b-256 | 206bff91a870dea001d575f92b2333c40f4d75988ebf7a994414f09110a68cd7 |