Skip to main content

An easy to use tools module for writing multi-thread and multi-process programs.

Project description

QSpider

License: MIT Pyversion Version

An easy to use tools module for writing multi-thread and multi-process programs.

Install

QSpider could be easily installed using pip:

$ pip install qspider

Usages

Using Module

# 1. import class QSpider and Task from qspider module 
#   and other modules.
from qspider import QSpider, Task
import requests

# 2. Define a list of task source.
#   Each of the element in this source list is called 'task_source'.
#   'task_source' could be any type, ie str, tuple, object, dict...,
#   it could also be requests.Session or something else.
source = ['https://www.baidu.com' for i in range(100)]

# 3. Create your own task (which need to extends Task).
class TestTask(Task):
    """A test task

    Attributes:
        task_source: the source which needed in the task.
          which is actually the 'task_source' in the source list.
    """
    def __init__(self, task_source):
        Task.__init__(self, task_source)

    def run(self):
        # process the self.task_source here.
        res = requests.get(self.task_source, timeout=3)
        # return values needed
        return res.status_code

# 4. Create the QSpider and run it.
test_spider = QSpider(source, TestTask, has_result=True)
results = test_spider.run()
print(results)

Run the script and you'll get:

[Info] 100 tasks in total.
[Input] Number of threads: 20
[  ] 100% |███████████████████████████████████| 100/100 [eta-0:00:00, 2.5s, 40.8it/s]
[200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, ... , 200]

Using command line

Create a QSpider using command:

$ genqspider -h
usage: Generate your qspider based on templates [-h] [-p] name

positional arguments:
  name           Your spider name

optional arguments:
  -h, --help     show this help message and exit
  -p, --process  Using multi-process instead of multi-thread template

Example

  1. Create a test crawler using QSpider.

    $ genqspider test
    A qspider named test is initialized.
    

    A python script named test.py is created in your current directory.

  2. Open the test.py,And you'll get:

    # -*- coding: utf-8 -*-
    
    from qspider import ThreadManager, Task
    
    class TestSpider(ThreadManager):
        def __init__(self, has_result=False, add_failed=True):
            self.name = "test"
            self.has_result = has_result
            self.add_failed = add_failed
            self.source = [0]  # define your source list
            super(TestSpider, self).__init__(self.source, self.QTask, has_result=self.has_result, add_failed=self.add_failed)
    
        class QTask(Task):
            def __init__(self, task_source):
                Task.__init__(self, task_source)
    
            def run(self):
                # parse single task source
                pass
    
    if __name__=="__main__":
        qspider = TestSpider()
        qspider.test()
        # qspider.run()
    
  3. Modify your source list with the line self.source = [0], and how you gonna process the task_source in the method QTask.run .

    # -*- coding: utf-8 -*-
    import requests
    from qspider.core import QSpider, Task
    
    class TestSpider(QSpider):
        def __init__(self, has_result=False, add_failed=True):
            self.name = "test"
            self.has_result = has_result
            self.add_failed = add_failed
            # 1. define your source list
            self.source = ['https://www.baidu.com' for i in range(100)]  
            super(TestSpider, self).__init__(self.source, self.QTask, has_result=self.has_result, add_failed=self.add_failed)
    
        class QTask(Task):
            def __init__(self, task_source):
                Task.__init__(self, task_source)
    
            # 2. Modify the run method
            def run(self):
                # process the self.task_source here.
                res = requests.get(self.task_source, timeout=3)
                # return values needed
                return res.status_code
    
    if __name__=="__main__":
      	# 3. 'has_result' is True when there are values returned in QTask.run method.
        qspider = TestSpider(has_result=True)
        # 4. receive the results after run the qspider.
        results = qspider.run()
        print(results)
    
  4. Run the script and you'll get:

    [Info] 100 tasks in total.
    [Input] Number of threads: 20
    [  ] 100% |███████████████████████████████████| 100/100 [eta-0:00:00, 2.5s, 40.8it/s]
    [200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, ... , 200]
    

Releases

  • v0.1.1: First release with basic classes.
  • v0.1.2: Reconstruct code, add ThreadManager, ProcessManager and other tool classes.

License

Copyright (c) 2020 tishacy.

Licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qspider-0.1.2.tar.gz (9.9 kB view hashes)

Uploaded Source

Built Distribution

qspider-0.1.2-py3-none-any.whl (11.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page